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Chapter 1 ® 
Introduction Geek for 


1.1 The Statistical Modeling Cycle 


We consider statistical modeling of insurance problems. This comprises the process 
of data collection, data analysis and statistical model building to forecast insured 
events that (may) happen in the future. This problem is at the very heart of statistics 
and statistical modeling. Our goal here is to present and provide the statistical tools 
that are useful in daily actuarial practice, in particular, we aim at describing the 
mathematical foundation behind these statistical concepts and how they can be 
applied. Statistical modeling has a wide range of applications, and, depending on 
the application, the theoretical aspects may be weighted differently. In insurance 
pricing we are mainly interested in optimal predictions, whereas economists often 
use statistical tools to explain observations, and in medical fields one is interested 
in causal effects that medications have on patients. Therefore, statistical theory is 
wide ranging, and one should always keep the corresponding application in mind. 
Shmueli [338] nicely discusses the difference between prediction and explanation; 
our focus here is mainly on prediction. 

Box—Jenkins [49] and McCullagh—Nelder [265] distinguish three processes in 
statistical modeling: (1) model identification/selection, (ii) estimation, and (iii) 
prediction. In our statistical modeling cycle these three points are slightly modified 
and extended: 


(1) Data collection, cleaning and pre-processing: 
This item takes at least 80% of the total time in statistical modeling. It includes 
exploratory data analysis, data visualization and data pre-processing. This part 
of the modeling cycle does not seem to be very scientific, however, it is a highly 
important step because only extended data analysis allows the modeler to fully 
understand the data. Based on this knowledge the modeler can formulate her/his 
research question, her/his model, etc. 
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(2) Selection of a model class: 
Based on the knowledge collected in the first item, the modeler has to select a 
suitable model class that is able to answer her/his research question. This model 
class can be in the sense of a data model (proper stochastic model), but it can 
also be an algorithmic model; we refer to the discussion on the “two modeling 
cultures” by Breiman [53]. 

(3) Choice of an objective function: 
Once the modeler has specified a model class, she/he needs to define a decision 
rule how a particular member of the model class is selected for the collected 
data. Often this is in terms of an objective function, e.g., a scoring rule or a loss 
function that quantifies misspecification. 

(4) Solving a (non-convex) optimization problem: 
Once the first three items are completed, one is left with an optimization 
problem that tries to find the best model within the selected model class w.r.t. the 
given objective function and the collected data. In simple cases this optimization 
problem is a convex minimization problem for which numerical tools are in 
place. In more complex cases the optimization problem is neither convex nor 
concave, and the ‘best’ solution can often not be found explicitly. In that case, 
also the meaning of solution needs to be discussed. 

(5) Model validation: 
In the final/next step, the selected and fitted model needs to be validated. That 
is, does the model fit to the data, does it serve at predicting new data, does 
it answer the research question adequately, is there any better model/process 
choice, etc.? 

(6) Possibly go back to (1): 
If the answers in item (5) are not satisfactory, one typically goes back to (1). 
For instance, data pre-processing needs to be done differently, etc. 


Especially, the two modeling cultures discussion of Breiman [53], after the turn 
of the millennium, has shaken up the statistical community. Having predictive 
performance as the main criterion, the data modeling culture has gradually shifted 
to the algorithmic culture, where the model itself plays a secondary role as long 
as the prediction is accurate. The latter is often in the form of a point predictor 
which can come from an algorithm. Lifting this discussion to a more scientific 
level, providing prediction uncertainty will slowly merge the two modeling cultures. 
There is an other interesting discussion by Efron [116] on prediction, estimation 
(of model parameters) and attribution (predictor selection), that is very much at 
the core of statistical modeling. In these notes we want to especially emphasize 
the one modeling culture view of Yu—Barter [397] who expect the two modeling 
cultures of Breiman [53] to merge much closer than one would expect. Our goal is 
to demonstrate how all these different techniques and views can be seen as a unified 
modeling framework. 

Concluding, the purpose of these notes is to discuss and illustrate how the 
different statistical techniques from the data modeling culture and the algorithmic 
modeling culture can be combined to solve actuarial questions in the best possible 
way. The main emphasis in this discussion lies on the statistical modeling tools, 
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and we present these tools along with actuarial examples. In actuarial practice one 
often distinguishes between life and general insurance. This distinction is done for 
good reasons. There are legislative reasons that require to legally separate life from 
general insurance business, but there are also modeling reasons, because insurance 
products in life and general insurance can have rather different features. In this book, 
we do not make this distinction because the statistical methods presented here can be 
useful in both branches of insurance, and we are going to consider life and general 
insurance examples, e.g., the former considering mortality forecasting and the latter 
aiming at insurance claims prediction for pricing. 


1.2 Preliminaries on Probability Theory 


The modern axiomatic foundation of probability theory was introduced in 1933 by 
the famous mathematician Kolmogoroff [221] in his book called “Grundbegriffe der 
Wahrscheinlichkeitsrechnung”. We give a brief introduction to probability theory 
and random variables; this introduction follows the lecture notes [387]. Throughout 
we assume to work on a sufficiently rich probability space (Q, A, P), meaning that 
this probability space should be able to carry all objects that we study. We denote 
(real-valued) random variables on this probability space by capital letters Y, Z,..., 
and random vectors use boldface capital letters, e.g., we have a random vector Y = 
(Yi .6 8, Y4)" of dimension q € N, where each component Yg, 1 < k < q,isa 
random variable. Random variables Y are characterized by (cumulative) distribution 
functions! F : R > [0, 1], for y € R 


F(y)=P[Y < y], 


being the probability of the event that Y has a realization of less or equal to y. We 
write Y ~ F for Y having distribution function F. Similarly random vectors Y ~ F 
are characterized by (cumulative) distribution functions F : R — [0, 1] with 


FQ) =P[Yi < yi, ..., Yq < y4] for y = (y1, -.-, yg)! € RY. 


In insurance modeling, there are two important types of random variables, 
namely, discrete random variables and absolutely continuous random variables: 


e The distribution function F of a discrete random variable Y is a step function 
with countably many steps in discrete points k € N C R. A discrete random 
variable has probability weights in these discrete points 


f(k) =P[Y =k]>0 fork € MN, 


' Cumulative distribution functions F are right-continuous, non-decreasing with limy- F(x) = 
0 and lim, F(x) = 1. 
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satisfying J zem f(k) = 1. If Xt C No, the integer-valued random variable Y 
is called count random variable. Count random variables are used to model the 
number of claims in insurance. A similar situation occurs if Y models nominal 
outcomes, for instance, if Y models gender with female being encoded by 0 and 
male being encoded by 1, then f (0) is the probability weight of having a female 
and f(1) = 1 — f(0) the probability weight of having a male; in this case we 
identify the finite set N = {0, 1} = {female, male}. 

A random variable Y ~ F is said to be absolutely continuous? if there exists a 
non-negative (measurable) function f, called density of Y, such that 


ro= f f(x)dx forall y € R. 


In that case we equivalently write Y ~ f and Y ~ F. Absolutely continuous 
random variables are often used to model claim sizes in insurance. 


More generally speaking, discrete and absolutely continuous random variables 


have densities f(-) w.r.t. a o-finite measure v on R. In the former case, this o- 
finite measure v is the counting measure on N C R, and in the latter case it is 
the Lebesgue measure on R. In actuarial science we also consider mixed cases, for 
instance, Tweedie’s compound Poisson random variable is absolutely continuous on 
(0, oo) having an additional point mass in 0; this model will be studied in Sect. 2.2.3, 
below. 


Choose a random variable Y ~ F and a measurable function h : R —> R. The 


expected value of h(Y) is defined by (upon existence) 


hw = [ho dFOD. 


We mainly focus on the following important examples of function h: 


e expected value, mean or first moment of Y ~ F: for h(y) = y 


u= i= f ydFO); 
R 


e k-th moment of Y ~ F fork € N: for h(y) = y* 


a[ ¥*] = [taro 


? Absolutely continuous is a stronger property than continuous. 
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* moment generating function of Y ~ F inr € R: for h(y) = e” 


my) =B[e”] = f earo 


always subject to existence. 

The moment generating function My (-) is sufficient for identifying distribution 
functions of random variables Y. The following statements are elementary and their 
proofs are based on Section 30 of Billingsley [34], for more details we also refer to 
Chapter | in the lecture notes [387]. Assume that the moment generating function 
of Y ~ F has a strictly positive radius of convergence pọ > 0 around the origin 
implying that My(r) < oo for allr € (—poọ, po). In this case we can write My (r) 
as a power series expansion 


My(r)= :[x*] for all r € (—po, po). 


k! 
k=0 


As a consequence we can differentiate My(-) in the open interval (— pọ, po) 
arbitrarily often, term by term under the sum. The derivatives in r = 0 provide 
the k-th moments (which all exist and are finite) 


k 


d 
— My(r)|-0 = z [x*] for all k € No. (1.1) 
dr* 


In particular, in this case we immediately know that all moments of Y exist, and 
these moments completely determine the moment generating function My of Y. 
Another consequence is that for a random variable Y, whose moment generating 
function My has a strictly positive radius of convergence around the origin, the 
distribution function F is fully determined by this moment generating function. 
That is, if we have two such random variables Yı and Y2 with My, (r) = My, (r) 


for all r € (—ro,ro), for some ro > O, then Yı S Y23 Thus, these two 
random variables have the same distribution function. This statement carries over 
to the limit, i.e., if we have a sequence of random variables (Y„)n whose moment 
generating functions converge on a common interval (—rọ, ro), for some ro > 0, 
to the moment generating function of Y, also being finite on (—7o, ro), then (Yn)n 
converges in distribution to Y; such an argument is used to prove the central limit 
theorem (CLT). 


3 The notation Yj 2 Y> is generally used for equality in distribution meaning that Yı and Y) have 
the same distribution function. 
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In insurance, we often deal with so-called positive random variables Y, meaning 
that Y > 0, almost surely (a.s.). In that case, the statements about moment 
generating functions and distributions hold true without the assumption of having a 
positive radius of convergence around the origin, see Theorem 22.2 in Billingsley 
[34]. Note that for positive random variables the moment generating function My (r) 
exists for all r < 0. 

Existence of the moment generating function My (r) for some positive r > 0 
can also be interpreted as having a light-tailed distribution function. Observe that 
if My(r) exists for some positive r > 0, then we can choose s € (0,7) and 
Chebychev’s inequality gives us (we assume Y > 0, a.s., here) 


< exp{—sy}My(s). (1.2) 


P[Y >y] = P [exp{sY} > exp{sy}] 


The latter tells us that the survival function 1 — F(y) = P[Y > y] decays 
exponentially for y —> oo. Heavy-tailed distribution functions do not have this 
property, but the survival function decays slower than exponentially as y > oo. 
This slower decay of the survival function is the case for so-called subexponential 
distribution functions (an example is the log-normal distribution, we refer to Rolski 
et al. [320]) and for regularly varying survival functions (an example is the Pareto 
distribution). Regularly varying survival functions 1 — F have the property 


1- F(ty) 1-2 f 
—— = or all tf > 0 and some £ > 0. (1.3) 
y>% 1 — FO) 

These distribution functions have a polynomial tail (power tail) with tail index B > 
0. In particular, if a positively supported distribution function F has a regularly 
varying survival function with tail index 6 > 0, then this distribution function is 
also subexponential, see Theorem 2.5.5 in Rolski et al. [320]. 

We are not going to specifically focus on heavy-tailed distribution functions, 
here, but we will explain how light-tailed random variables can be transformed to 
enjoy heavy-tailed properties. In these notes, we are mainly interested in studying 
different aspects of regression modeling. Regression modeling requires numerous 
observations to be able to successfully fit these models to the data. By definition, 
large claims are scarce, as they live in the tail of the distribution function and, thus, 
correspond to rare events. Therefore, it is often not possible to employ a regression 
model for scarce tail events. For this reason, extreme value analysis only plays 
a marginal role in these notes, though, it has a significant impact on insurance 
prices. For more on extreme value theory we refer to the relevant literature, see, 
e.g., Embrechts et al. [121], Rolski et al. [320], Mikosch [277] and Albrecher et 
al. [7]. 


1.3 Lab: Exploratory Data Analysis 7 
1.3 Lab: Exploratory Data Analysis 


Our theory is going to be supported by several data examples. These examples are 
mostly based on publicly available data. The different data sets are described in 
detail in Chap. 13. We highly recommend the reader to use these data sets to gain 
her/his own modeling experience. 

We describe some tools here that allow for a descriptive and exploratory analysis 
of the available data; exploratory data analysis has been introduced and promoted by 
Tukey [357]. We consider the observed claim sizes of the Swedish motorcycle data 
set described in Sect. 13.2. This data set consists of 656 (positive) claim amounts y;, 
1 <i <n = 656. These claim amounts are illustrated in the boxplots of Fig. 1.1. 

Typically in insurance, there are large claims that dominate the picture, see 
Fig. 1.1 (lhs). This results in right-skewed distribution functions, and such data is 
better illustrated on the log scale, see Fig. 1.1 (rhs). The latter, of course, assumes 
that all claims are strictly positive. 

Figure 1.2 (Ihs) shows the empirical distribution function of the observations yj, 
1 <i <n, which is obtained by 


S i 
Fa (y) = Apis fory ER. 


i=1 


If this data set has been generated by i.i.d. random variables, then the Glivenko— 
Cantelli theorem [64, 159] tells us that this empirical distribution function F, 
converges uniformly to the (true) data generating distribution function, a.s., as the 
number n of observations converges to infinity, see Theorem 20.6 in Billingsley 
[34]. 

Figure 1.2 (rhs) shows the empirical density of the observations yj, 1 < i < 
n. This empirical density is obtained by considering a kernel smoother of a given 


claim amounts of Swedish motorcycle data claim amounts of Swedish motorcycle data 
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Fig. 1.1 Boxplot of the claim amounts of the Swedish motorcycle data set: (lhs) on the original 
scale and (rhs) on the log scale 
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Fig. 1.2 (lhs) Empirical distribution and (rhs) empirical density of the observed claim amounts y;, 
l<i<n 


bandwidth around each observation y;. The standard choice is the Gaussian kernel, 
with the bandwidth determining the variance parameter o? > 0 of the Gaussian 
density, 


~ Io O1 1O- y? 
ye o= b aaeh) 


From the graph in Fig. 1.2 (rhs) we observe that the main body of the claim sizes 
is below an amount of 50’000, but the biggest claim exceeds 200’000. The latter 
motivates to study heavy-tailedness of the claim size data. Therefore, one usually 
benchmarks with a distribution function F that has a regularly varying survival 
function with a tail index 8 > 0, see (1.3). Asymptotically a regularly varying 
survival function behaves as y~*; for this reason the log-log plot is a popular tool 
to identify regularly varying tails. The log-log plot of a distribution function F is 
obtained by considering 


y>0 > (logy, log(1 — F(y))) € R’. 


Figure 1.3 gives the log-log plot of the empirical distribution function Fy. If this 
plot looks asymptotically (for y — oo) like a straight line with a negative slope 
—B, then the data shows heavy-tailedness in the sense of regular variation. Such 
data cannot be modeled by a distribution function for which the moment generating 
function My (r) exists for some positive r > 0, see (1.2). Figure 1.3 does not suggest 
a regularly varying tail as we do not see an obvious asymptotic straight line for 
increasing claim sizes. 

These graphs give us a first indication what the claim size data is about. Later 
on we are going to introduce explanatory variables that describe the insurance 
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Fig. 1.3 Log-log plot of the log-log plot of claim amounts 
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policyholders behind these claims. These explanatory variables characterize the 
policyholder and the general goal is to get a better description of the claim sizes 
as a function of these explanatory variables, e.g., older policyholders may cause 
larger claims than younger ones, etc. Such patterns are called systematic effects that 
can be explained by explanatory variables. 


1.4 Outline of This Book 


This book has eleven chapters (including the present one), and it has two appendices. 
We briefly describe the contents of these chapters and appendices. 

In Chap. 2 we introduce and discuss the exponential family (EF) and the 
exponential dispersion family (EDF). The EF and the EDF are by far the most 
important classes of distribution functions for regression modeling. They include, 
among others, the Gaussian, the binomial, the Poisson, the gamma, the inverse 
Gaussian and Tweedie’s models. We introduce these families of distribution func- 
tions, discuss their properties and provide several examples. Moreover, we introduce 
the Kullback—Leibler (KL) divergence and the Bregman divergence, which are 
important tools in model evaluation. 

Chapter 3 is on classical statistical decision theory. This chapter is important for 
historical reasons, but it also provides the right mathematical grounding and intu- 
ition for more modern tools from data science and machine learning. In particular, 
we discuss maximum likelihood estimation (MLE), unbiasedness, consistency and 
asymptotic normality of MLEs in this chapter. 

Chapter 4 is the core theoretical chapter on predictive modeling and forecast 
evaluation. The main problem in actuarial modeling is to forecast and price future 
claims. For this, we build predictive models, and this chapter deals with assessing 
and ranking these predictive models. We therefore introduce the mean squared 
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error of prediction (MSEP) and, more generally, the generalization loss (GL) 
to assess predictive models. This chapter is complemented by a more decision- 
theoretic approach to forecast evaluation, it discusses deviance losses, proper 
scoring, elicitability, forecast dominance, cross-validation, Akaike’s information 
criterion (AIC) and we give an introduction to the bootstrap simulation method. 

Chapter 5 discusses state-of-the-art statistical modeling in insurance which is the 
generalized linear model (GLM). We discuss GLMs in the light of claim count and 
claim size modeling, we present feature engineering, model fitting, model selection, 
over-dispersion, zero-inflated claim counts problems, double GLMs, and insurance- 
specific issues such as the balance property for having unbiasedness. 

Chapter 6 summarizes some techniques that use Bayes’ theorem. These are 
classical Bayesian statistical models, e.g., using the Markov chain Monte Carlo 
(MCMC) method for model fitting. This chapter discusses regularization of regres- 
sion models such as ridge and LASSO regularization, which has a Bayesian 
interpretation, and it concerns the Expectation-Maximization (EM) algorithm. The 
EM algorithm is a general purpose tool that can handle incomplete data settings. We 
illustrate this for different examples coming from mixture distributions, censored 
and truncated claims data. 

The core of this book are deep learning methods and neural networks. Chapter 7 
considers deep feed-forward neural (FN) networks. We introduce the generic 
architecture of deep FN networks, and we discuss universality theorems of FN 
networks. We present network fitting, back-propagation, embedding layers for 
categorical variables and insurance-specific issues such as the balance property in 
network fitting and network ensembling to reduce model uncertainty. This chapter 
is complemented by many examples on non-life insurance pricing, but also on 
mortality modeling, as well as tools that help to explain deep FN network regression 
results. 

Chapters 8 and 9 consider recurrent neural (RN) networks and convolutional 
neural (CN) networks. These are special network architectures that are useful for 
time-series and spatial data modeling, e.g., applied to image recognition problems. 
Time-series and images have a natural topology, and RN and CN networks try to 
benefit from this additional structure (over tabular data). We introduce these network 
architectures and provide insurance-relevant examples. 

Chapter 10 discusses natural language processing (NLP) which deals with 
regression modeling of non-tabular or unstructured text data. We explain how 
words can be embedded into low-dimension spaces that serve as numerical word 
encodings. These can then be used for text recognition, either using RN networks or 
attention layers. We give an example where we aim at predicting claim perils from 
claim descriptions. 

Chapter 11 is a selection of different topics. We mention forecasting under 
model uncertainty, deep quantile regression, deep composite regression or the 
LocalGLMnet which is an interpretable FN network architecture. Moreover, we 
provide a bootstrap example to assess prediction uncertainty, and we discuss mixture 
density networks. 
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Chapter 12 (Appendix A) is a technical chapter that discusses universality the- 
orems for networks and sieve estimators, which are useful for studying asymptotic 
normality within a network framework. Chapter 13 (Appendix B) illustrates the data 
used in this book. 

Finally, we remark that the book is written in a typical mathematical style 
using the structure of Lemmas, Theorems, etc. Results and statements which are 
particularly important for applications are highlighted with gray boxes. 
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Chapter 2 ® 
Exponential Dispersion Family cost 


We introduce the exponential family (EF) and the exponential dispersion family 
(EDF) in this chapter. The single-parameter EF has been introduced in 1934 
by the British statistician Sir Fisher [128], and it has been extended to vector- 
valued parameters by Darmois [88], Koopman [223] and Pitman [306] between 
1935 and 1936. It is the most commonly used family of distribution functions 
in statistical modeling; among others, it contains the Gaussian distribution, the 
gamma distribution, the binomial distribution and the Poisson distribution. Its 
parametrization is taken in a special form that is convenient for statistical modeling. 
The EF can be introduced in a constructive way providing the main properties of 
this family of distribution functions. In this chapter we follow Jørgensen [201-203] 
and Barndorff-Nielsen [23], and we state the most important results based on this 
constructive introduction. This gives us a unified notation which is going to be useful 
for our purposes. 


2.1 Exponential Family 


2.1.1 Definition and Properties 


We define the EF w.r.t. a o -finite measure v on R. The results in this section can be 
generalized to o-finite measures on R”, but such an extension is not necessary for 
our purposes. Select an integer k € N, and choose measurable functions a : R > 
R and T : R > R*.! Consider for a canonical parameter 0 € R* the Laplace 


' We could also use boldface notation for T because T(y) € RÝ is vector-valued, but we prefer to 
not use boldface notation for (vector-valued) functions. 
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transform 
210) = | exp [TTO +a») avo. 
R 


Assume that this Laplace transform is not identically equal to +00. The effective 
domain is defined by 


© = fo eR. £0) < oo] cR. (2.1) 


Lemma 2.1 The effective domain © C R* is a convex set. 


The effective domain © is not necessarily an open set, but in many applications it 
is open. Counterexamples are given in Problem 4.1 of Chapter 1 in Lehmann [244], 
and in the inverse Gaussian example in Sect. 2.1.3, below. 
Proof of Lemma 2.1 Choose 6; € Rt, i = 1,2, with £(0;) < oo. Set 0 = c0; + 
(1 — c)@2 for c € (0, 1). We use Holder’s inequality, applied to the norms p = 1/c 
and q = 1/(1 — c), 


26) = f exp {(c01 + 1-962)" Ty) a0} dvo) 


Cc l-c 
= f exp [oF 70) +a] exp [OI TO) +a(y)} dvo) 
R 
< L01) L02) < o. 


This implies 0 € © and proves the claim. o 


We define the cumulant function on the effective domain © 
K:0-R, Or k (0) = log£&(6). 


Definition 2.2 The EF with o-finite measure v on R and cumulant function 
k : © > Ris given by the distribution functions F on R with 


dF(y:8) = f(y: Pdv) = exp {07 T(y) — K0) +.a(y)} dvo), 
(2.2) 
for canonical parameters 0 € © C R4. 
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Remarks 2.3 


e The definition of the EF (2.2) assumes that the effective domain © C R* has 
been constructed from the choices a : R > Rand T : R > R* as described 
in (2.1). This is not explicitly stated in the surrounding text of (2.2). 

e The support of any random variable Y ~ F(-; 0) of this EF does not depend on 
the explicit choice of the canonical parameter 6 € ©, but solely on the choice of 
the o-finite measure v on R, and the distribution functions F (-; 0) are mutually 
absolutely continuous (equivalent) w.r.t. v. 

e In statistics, the main object of interest is the canonical parameter 0. Importantly 
for parameter estimation, the function a(-) does not involve the canonical 
parameter. Therefore, it is irrelevant for parameter estimation and (only) serves 
as a normalization so that F in (2.2) is a proper distribution function. In fact, this 
is the way how the EF is often introduced in the statistical and actuarial literature, 
but in this latter introduction we lose the deeper interpretation of the cumulant 
function «, nor is it immediately clear what properties it possesses. 

e The case k > 2 gives a vector-valued canonical parameter 0. The case k = 1 
gives a single-parameter EF, and, if additionally T(y) = y, it is called a single- 
parameter linear EF. 


Theorem 2.4 Assume the effective domain © has a non-empty interior ©. Choose 
Y ~ F(;0) for fixed 0 € ©. The moment generating function of T(Y) for 
sufficiently small r € RÝ is given by 


Mrvy)(r) = Eo [exp [mro ]] = exp{k (0 +r) — «k (0)}, 


where the expectation operator Eg illustrates the selected canonical parameter 0 
for Y. 


Proof Choose 0 € © andr € RÝ so small that 0 + r € O. We receive 
Mrqn(r) = [ole +r)'T(y) —«@) +.a(y)} dvo) 


= exp {x0 + r) — «(6)} f exp [0 +7) T() -10 +r) +a0)} dvo) 
R 
= exp {k (0 + r) —K@)}, 


where the last identity follows from the fact that the support of the EF does not 
depend on the explicit choice of the canonical parameter. o 


Theorem 2.4 has a couple of immediate implications. First, in any interior point 
6 € © both the moment generating function r +> Mr(y)(r) (in the neighborhood of 
the origin) and the cumulant function 0 > «x (0) have derivatives of all orders, and, 
similarly to Sect. 1.2, moments of all orders of T (Y) exist, see also (1.1). Existence 
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of moments of all orders implies that the distribution function of T (Y) cannot have 
a regularly varying tails. 


Corollary 2.5 Assume Ò is non-empty. The cumulant function 6 +> « (0) is 
convex, and for Y ~ F(-; 0) with 0 € © 


u = Eg [T (Y)] = Vox (0) and Vara (T(Y)) = Vox), 


where Vọ is the gradient and Vo the Hessian w.r.t. vector 0. 


Similarly to T : R > R*, we will not use boldface notation for the (multi- 
dimensional) mean because later on we will understand the mean u = (0) € R* 
as a function of the canonical parameter 0; see Footnote | on page 13 on boldface 
notation. 
Proof Existence of the moment generating function for all sufficiently small r € R* 
(around the origin) implies that we have first and second moments. For the first 
moment we receive 


u = Eo [T (Y)] = Vr MT )| 0 = exp{k(0+r) —k(@)} VrK(O4+7)|--9 = Vok (0). 


Denote component j of T(Y) € R* by T;(Y). We have for 1 < j,l < k 


32 
“0 [7,271 (Y)] = aro MIM) 
J 


r=0 
2 


eeto oaa SOHN OnO) 


r=0 


a? 
= (seam 00) KO) +5 aK Om O 


This implies for the covariance 


2 


ð 
Cove (T; (Y), Ti(Y)) = 30,001 =z a7 k (8): 


The convexity of « follows because VoK (0) is the positive semi-definite covariance 
matrix of T (Y), for all 0 € O. This finishes the proof. oO 


Assumption 2.6 (Minimal Representation) We assume that the interior © 
of the effective domain © is non-empty and that the cumulant function k is 
strictly convex on this interior ©. 
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Remarks 2.7 


e Throughout these notes we will work under Assumption 2.6 without making 
explicit reference. This assumption strengthens the properties of the cumulant 
function « from being convex, see Corollary 2.5, to being strictly convex. This 
strengthening implies that the mean function 0 > u = u(0) = Vex (0) can be 
inverted; this is needed for the canonical link, see Definition 2.8, below. 

e The strict convexity of x means that the covariance matrix VoK (0) of T(Y) is 
positive definite and has full rank k for all 0 € Ò, see Corollary 2.5. This property 
is important, otherwise we do not have identifiability in the canonical parameter 
0 because we have a linear dependence between the components of T (Y). 

e Mathematically, this strict convexity is not a restriction because it can be obtained 
by working under a so-called minimal representation. If the covariance matrix 
VoK (0) does not have full rank k, the choice k is “non-optimal” because the 
problem lives in a smaller dimension. Thus, w.l.o.g., we may and will assume to 
work in this smaller dimension, called minimal representation; for a rigorous 
derivation of a minimal representation we refer to Section 8.1 in Barndorff- 
Nielsen [23]. 


Definition 2.8 The canonical link is defined by h = (Vgx)7!. 


The application of the canonical link h to the mean implies under Assumption 2.6 


h (w) = h (Eg [T(Y)]) = 8, 


for mean u = Eg[T(Y)] of Y ~ F(-; 0) with @ € O. 


Remarks 2.9 (Dual Parameter Space) Assumption 2.6 provides that the 
canonical link h is well-defined, and we can either work with the canonical 
parameter representation 9 € © c RÝ or with its dual (mean) parameter 
representation u = Eg [T (Y)] € M with 


M È Vor (Ò) = {Vor (0); 0 € OF CRE. (2.3) 


Strict convexity of k implies that there is a one-to-one correspondence 
between these two parametrizations. © is called the effective domain and M 
is called the dual parameter space or the mean parameter space. 


In Sect. 2.2.4, below, we introduce one more property called steepness that the 
cumulant function « should satisfy. This additional property gives a relationship 
between the support Y of the random variables T(Y) of the given EF and the 
boundary of the dual parameter space M. This steepness property is important for 
parameter estimation. 
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2.1.2 Single-Parameter Linear EF: Count Variable Examples 


We start by giving single-parameter discrete linear EF examples based on counting 
measures on No. Since we work in one dimension k = 1, we replace boldface 0 by 
scalar 0 € © C R in this section. 


Bernoulli Distribution as a Single-Parameter Linear EF 


For the Bernoulli distribution with parameter p € (0, 1) we choose as v the counting 
measure on {0, 1}. We make the following choices: T(y) = y, 


6 
a(y)=0, «@)=logd+e’), p=k'(@)= i = , 0=h(p)=log (4) ; 


for effective domain © = R, dual parameter space M = (0, 1) and support T = 
{0, 1} of Y = T (Y). With these choices we have 


NI I 1-y 
dF(y; 0) = exp {Oy — log(1 + e®)) dv(y) = (i) (<=) dv(y). 


6 + «'(@) is the logistic or sigmoid function, and the canonical link p > h(p) is 
the logit function. Mean and variance are given by 


6 


e — 
(+e)? 


u = Eọ[Y]=x«'(0)=p and Varg (Y) =x" (6) = p(l — p), 


and the probability weights satisfy for y € T = {0, 1} 


PLY = y] = pP — p)'™. 


Binomial Distribution as a Single-Parameter Linear EF 


For the binomial distribution with parameters n € N and p € (0, 1) we choose as v 
the counting measure on {0, ...,}. We make the following choices: T (y) = y, 


ne? 
14e?’ 


=1 "). 6) = nlog(1+e), w=«'(6) = 6 =h(w) =) (+). 
ay) =log("), KO = nogat, n=O w=lg( 


for effective domain © = R, dual parameter space M = (0, n) and support $ = 
{0,...,n} of Y = T (Y). With these choices we have 


dF(y;6) =(" Se ola poe ee) E 
Q; = (>l y —nlog(1+e )} voy=(") (<=) (<=) v(y). 
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Mean and variance are given by 


0 


e 
(+e?) 


u = Eg [Y] = «'(0) = np and Varo (Y) = k" (0) =n 


=np(l — p), 


where we set p = e?/(1 + e°). The probability weights satisfy for y € I = 
{0,... 7} 


PolY = y] = (")ora - py. 


Poisson Distribution as a Single-Parameter Linear EF 


For the Poisson distribution with parameter A > 0 we choose as v the counting 
measure on No. We make the following choices: T(y) = y, 


1 
a(y) = log (=). k(0)=, u=K'(0)=, O=hA(u) =log(u), 


for effective domain © = R, dual parameter space M = (0, co) and support T = 
No of Y = T(Y). With these choices we have 


1 5 o w 
dF(y;0)= ai exp {0y — e°} dv(y) =e ROD: (2.4) 


The canonical link u + A(z) is the log-link. Mean and variance are given by 


w= Ee [Y] =x’) =A and Varo (Y) = k" (0) = à = u = Ee [Y], 


where we set 4 = e°. The probability weights in the Poisson case satisfy for y € 
T= No 


Negative-Binomial (Pólya) Distribution as a Single-Parameter Linear EF 
For the negative-binomial distribution with a > 0 and p € (0, 1) we choose as 


v the counting measure on No; œ plays the role of a nuisance parameter or hyper- 
parameter. We make the following choices: T(y) = y, 


a(y) = toe(” eres ') «(0) = —alog(1 — e°), 
y 
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0 


e u 
= 0) = —, 0=h =| —— ag. Py 
p= KO) =a, w os(4_) 


for effective domain © = (—oo, 0), dual parameter space M = (0, 00) and support 
T = No of Y = T (Y). With these choices we have 


dF(y; 0) = C Tu= ') exp foy + alog(1 — e?)} dv(y) 
y 
-1 
= P a= p avo, 


with p = e°. Parameter œ > 0 is treated as nuisance parameter, otherwise we drop 
out of the EF framework. We have first the two moments 
p 


e e? 
u= E[Y] = az =u and Varọ(Y) = Ee[Y]{ 1 + z]> Ee [LY]. 
l-e' 1—p l-e' 


This model allows us to model over-dispersion, in contrast to the Poisson model. 
In fact, the negative-binomial model is a mixed Poisson model with a gamma 
mixing distribution, for details see Sect. 5.3.5, below. Typically, one uses a different 
parametrization. Set e? = A/(a@ + A), for A > 0. This implies 


À 
u = EolY] =A and vwm = 2 (142) >. 
a 


For a € N this model can also be interpreted as the waiting time until we observe 
a successful trials among i.i.d. trials, for instance, fora = 1 we have the geometric 
distribution (with a small reparametrization). 

The probability weights of the negative-binomial model satisfy for y € T = No 


+a-1\ , 
Poly =y1=(” ; jr a-p. (2.5) 


2.1.3 Vector-Valued Parameter EF: Absolutely Continuous 
Examples 


We give vector-valued parameter absolutely continuous EF examples with k = 2, 
and being based on the Lebesgue measure on (subsets of) R, in this section. 
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Gaussian Distribution as a Vector- Valued Parameter EF 


For the Gaussian distribution with parameters u € R and ø? > 0 we choose as v 
the Lebesgue measure on R, and we make the following choices: T (y) = (y, y*)', 


(y) = S (27) (0) = a L (—262) 
aly) = 708 T), K = 40> 708 2), 


z 
fa) 2 

(u,07 + n°)" = Vox (0) = (- ae (—262)"! + 5) , 
2 


for effective domain © = R x (—oo, 0), dual parameter space M = R x (0, œœ) 
and support T = R x [0, oo) of T (Y) = (Y, Y*)'. With these choices we have 


1 60 ıl 
dF(y; 0) = O'T — + -log(—202) t d 
(y; 0) zl O)+ 16, +5 og( >| v(y) 
1 1 1 ( 61 i es 
= —_—__ exp | —- = ——___ — — v(y). 
/ 27 (—262)~1/2 H ara ae 7 
This is the Gaussian model with mean u = 0ı/(—202) and variance o = 
(—202)7!. 


If we treat o > 0 as a nuisance parameter, we obtain the Gaussian model as a 
single-parameter EF. This is the most common example of an EF. Set T (y) = y/o 
and 


aly) = -Żlog@ro?) —y?/Qo*), «@)=67/2, p=xK'(@)=0, 0 =h(p)= n, 


for effective domain © = R, dual parameter space M = R and support T = R of 
T(Y) = Y/o. With these choices we have 


dF(y; 0) = 


exp {6y/o — y?/20%) — 6/2] dv(y) 


ITO 


1 1 
= ane exp l- (y- ao?) dv(y), 


and, in particular, the canonical link is the identity link u > 0 = h(w) = n in this 
single-parameter EF example. 
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Gamma Distribution as a Vector- Valued Parameter EF 


For the gamma distribution with parameters a, B > 0 we choose as v the Lebesgue 
measure on R+. Then we make the following choices: T (y) = (y, logy) |, 


a(y) = —logy, k(0) = logr (02) — &2log(—61), 
r'(a) n o (a To l 
(o6. To lox) = Vok (0) = (=. Ti) loge) . 
for effective domain ®© = (—c,0) x (0, œœ), and setting 6 = —0ı > 0 and 


œ = 62 > 0. The dual parameter space is M = (0, 00) x R, and we have support 
Y= (0, co) x Rof T(Y) = (Y, logY)!. With these choices we obtain 


dF (y; 0) = exp [OTT O) — logl (6) + 62log(—61) — logy davy) 


(61) 
~ TO) 


a 


y”! exp {—(—61) y} dv) 


Fe? Piby) dv). 


This is a vector-valued parameter EF with k = 2, and the first moment is given by 


ra |Y, 108r)" | = Vor) = (a8 re _ lot) 
ufe, e | 


Parameter a is called shape parameter and parameter £ is called scale parameter.” 
If we treat the shape parameter œ > 0 as a nuisance parameter we can turn the 
gamma distribution into a single-parameter linear EF. Set T(y) = y and 


$ a a 
a(y) = (a — 1)logy — logr (œ), «(@) = —alog(—6), u = k'(0) =a! 6 =h(u) T 


for effective domain © = (—oo, 0), dual parameter space M = (0, 00) and support 
T = (0, cco). With these choices we have for 8 = —0 > 0 


(0) 


Ta) y*—! exp{—(—6)y} dv(y). (2.6) 


dF(y; 0) = 


This provides us with mean and variance 


1 
and o? = Varg (Y) = SANE =p 
P a 


u = EolY] = 


2 The function W(x) = flog (x) = IT (x)/T (x) is called digamma function. 
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For parameter estimation one often needs to invert these identities which gives us 


«= — and pai, 
o 


Remarks 2.10 


The gamma distribution contains as special cases the exponential distribution for 
a = 62 = land 6 = —0; > 0, and the x?-distribution with r degrees of freedom 
fora = 62 = r/2 and B = —0; = 1/2. 

The distributions of the EF are all light-tailed in the sense that all moments 
of T(Y) exist. Therefore, the EF does not allow for regularly varying survival 
functions, see (1.3). If Y is gamma distributed, then Z = exp{Y} is log-gamma 
distributed (with the special case of the Pareto distribution for the exponential 
case @ = 62 = 1). For an example we refer to Sect. 2.2.5. However, this log- 
transformation is not always recommended because it may provide accurate 
models on the transformed log-scale, but back-transformation to the original 
scale may not necessarily provide a good predictive model on that original scale. 
The gamma density (2.6) may be a bit tricky in applications because the effective 
domain © = (—o«, 0) is one-sided bounded (we come back to this below). For 
this reason, in practice, one often uses links different from the canonical link 
h(u) = —a/w. For instance, a parametrization 0 = — exp{—v} for } € R, see 
Ohlsson—Johansson [290], leads to the following model 


yao! 
dF(y; 0) = TO 


exp {—e"y — ad} dv(y). (2.7) 


We will study the gamma model in more depth below, and parametrization (2.7) 
will correspond to the log-link choice, see Example 5.5, below. 


Figure 2.1 gives examples of gamma densities for shape parameters œ € 


{1/2, 1, 3/2, 2} and scale parameters £ € {1/2, 1, 3/2, 2} with a = £ all providing 
the same mean u = Eg[Y] = a/f = 1. The crucial observation is that these gamma 
densities can have two different shapes, fora < 1 we have a strictly decreasing 
shape and for œ > 1 we have a unimodal density with mode in (a — 1)/. 


Inverse Gaussian Distribution as a Vector-Valued Parameter EF 


For the inverse Gaussian distribution with parameters a, 6B > 0 we choose as v the 
Lebesgue measure on R+. Then we make the following choices: T (y) = (y, 1/y)', 


1 1 
a(y) = —slogQry*), x0) = — 20102)? — 5!08(—262), 


T —20» 1/2 ~26, 1/2 1 T 
(a/p. Ba + 1/a ) = Vek (0) = (=) 3 (=) +) ; 
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Fig. 2.1 Gamma densities gamma densities 
for shape parameters 25 ,| —— allpha=0.5, beta=0.5 
a € {1/2, 1, 3/2, 2} and scale | — alpha=1, beta=1 
| —— alpha=1.5, beta=1.5 
parameters eo | — alpha=2, beta=2 
p € {1/2, 1, 3/2, 2} all j 
providing the same mean 
u=a/ß=1 34 


for 0 = (01,02)! € (—o0,0)?, and setting B = (—201)!/? and æ = (—202)!/?. 
The dual parameter space is M = (0, œ0)?, and we have support $ = (0, 00)” of 
T(Y) = (Y,1/ Y)! . With these choices we obtain 


dF (y; 0) = exp {erro + 20102)? + slogi-2) — slogony’)| dv(y) 


1 
= Sar 20)" exp | zy (C203 + C20) 4(0100)"y)} avo) 


2 2 
7 nt ap -z (: _ Es) | avon (2.8) 


This is a vector-valued parameter EF with k = 2 and with first moment 


o [Œ, 1/97] = Vox) = (a/B. B/a + 1/0?) . 


For receiving (2.8) we have chosen canonical parameter 0 = (01, b2)! € (=œ, 0). 
Interestingly, we can close this parameter space for 0; = 0, i.e., the effective domain 
© is not open in this example. The choice 0; = 0 gives us cumulant function «x (0) = 
— slog(—262) and boundary case 


dF(y; 0) = exp lro + slos(—202) - rosary} dv(y) 


1 
~ (2a y3)1/2 


m x a? d 2.9 
m (27 y3)!/2 exp {| v(y). ( . ) 


(—20)"/ exp |- dv(y) 
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This is the distribution of the first-passage time of level a > O of a standard 
Brownian motion, see Bachelier [20]; this distribution is also known as Lévy 
distribution. 

If we treat œ > O as a nuisance parameter, we can turn the inverse Gaussian 
distribution into a single-parameter linear EF by setting T(y) = y, 


2 
Q a 1/2 
=log( —*_) - Z, x0) = —a(—20)"/2, 
a(y) (sa) zy KO) = —a(-28) 
1a? 
Sf. Fai Se 


for 6 € (—oo, 0), dual parameter space M = (0, 00) and support T = (0, oo). With 
these choices we have the inverse Gaussian model for 8 = (—20)!/2 > 0 


1 
dF(y; 0) = explaty ep] sy (C20)? = 2a( 20)'!2y) | avo») 


a a B 
= ape? [35 (IY) J HO 


This provides us with mean and variance 


a a 1 
u = Ee[Y] = — and 08 eal cial 


B 


For parameter estimation one often needs to invert these identities, which gives us 


a= — and p=—. 


Figure 2.2 gives examples of inverse Gaussian densities for parameter choices 
a = ß € {1/2, 1, 3/2, 2} all providing the same mean u = Eg[Y] = a/f = 1. 


Generalized Inverse Gaussian Distribution as a Vector-Valued Parameter 
EF 


For the generalized inverse Gaussian distribution with parameters a, B > O and 
y € R we choose as v the Lebesgue measure on R+}. We combine the terms of 
the gamma and the inverse Gaussian models to the vector-valued choice: T(y) = 
(y, logy, 1/y)' with k = 3. Moreover, we choose a(y) = —logy and cumulant 
function 


0 
K (0) = log (2Ko(2/0183)) — =1og(01/8), 


26 


Fig. 2.2 Inverse Gaussian 
densities for parameters 
a= p € {1/2, 1, 3/2, 2} all 
providing the same mean 
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inverse Gaussian densities 


—— alpha=0.5 
— alpha=1 
—— alpha=1.5 
— alpha=2 


w=a/p=l | 


density 


for 0 = (61, 62,03)' € (—oo,0) x R x (—oo, 0), and where Kg, denotes the 
modified Bessel function of the second kind with index y = 62 € R. With these 
choices we obtain generalized inverse Gaussian density 


dF (y; 8) = exp [oro - log (2Ko,(2/8185)) + Flog /03) — logy} dv(y) 


_ @/pyr? 
~ 2K, (vap) 


setting a = —20ı and B = —203. This is a vector-valued parameter EF with k = 3, 
and the first moment is given by 


1 T 
oF) (r logY, z) i = Vox (0) 
T 
= Kyi (vaf) a K Ky+iVoB) B) _ 2y 
- (2 K BAT a ogy g la ogK, (vap), K K, Jap) B 


The effective domain © is a bit complicated because the possible choices of (01, 03) 
depend on 62 € R, namely, for 62 < 0 the negative half-line (—oo, 0] can be closed 
at the origin for 01, and for 62 > 0 it can be closed at the origin for 63. The inverse 
Gaussian model is obtained for 62 = —1/2 and the gamma model is obtained for 
63 = O. For further properties of the generalized inverse Gaussian distribution we 
refer to the textbook of Jørgensen [200]. 


1 
y’—| exp |- (ay + By~!) | dv(y), (2.10) 
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2.1.4 Vector-Valued Parameter EF: Count Variable Example 


We close our EF examples by giving a discrete example with a vector-valued 
parameter. 


Categorical Distribution as a Vector- Valued Parameter EF 


For the categorical distribution with k € N and p € (0, 1)* such that “i pi <l, 


we choose as v the counting measure on the finite set {1,..., k + 1}. Then we make 
the following choices: T(y) = (Lyy=1y,---, Ipaq)! € RX, 0 = @,...,0%)', 
ef = (e%!,..., e%)! and 
k e? 
a(y)=0, «(0)=log} 1+ Se J, p = Vox (6) = ———_, 
2 1+ Aa ei 


for effective domain © = R*, dual parameter space M = (0, 1)*, and the support 
T of T(Y) are the k + 1 corners of the unit simplex in R*. This representation is 
minimal, see Assumption 2.6. With these choices we have (set 0x41 = 0) 


k k+l i Lyi 
dF(y; 0) = exp [rro — log ( + ye) | dv(y) = I] (=) dv(y). 
e l 


i=l j=l i=1 


This is a vector-valued parameter EF with k € N. The canonical link is slightly 
more complicated. Set vectors v = exp{@} € R* and w = (1,...,1)' € R*. This 
provides p = Vox (0) = TET”? e RÉ. Set matrix Ap=1- pw! € R*** the 


latter gives us p = Apv, and since Ap has full rank k, we obtain canonical link 


z2 Pp 


The last identity can be verified by explicit calculation 


JESA 
o ( p ) =e c a = iog (°) =0. 
l1—w Pp [=e (Us De) 


Remarks 2.11 


e There are many more examples that belong to the EF. From Theorem 2.4, we 
know that all examples of the EF are light-tailed in the sense that all moments of 
T (Y) exist. If we want to model heavy-tailed distributions within the EF, we first 
need to apply a suitable transformation. We could model the Pareto distribution 
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using transformation T(y) = logy, and assuming that the transformed random 
variable has an exponential distribution. Different light-tailed examples are 
obtained by, e.g., using transformation T(y) = y” for the Weibull distribution 
or T (y) = (logy, log(1 — y))' for the beta distribution. We refrain from giving 
explicit formulas for these or other examples. 

e Observe that in all examples above we have I C M, i.e., the support of T (Y) 
is contained in the closure of the dual parameter space M, we come back to this 
observation in Sect. 2.2.4, below. 


2.2 Exponential Dispersion Family 


In the previous section we have introduced the EF, and we have explicitly studied the 
vector-valued parameter EF examples of the Gaussian, the gamma and the inverse 
Gaussian models. We have highlighted that these three vector-valued parameter 
EFs can be turned into single-parameter EFs by declaring one parameter to be 
a nuisance parameter that is not modeled (and acts as a hyper-parameter). This 
changes these three models into single-parameter EFs. These three single-parameter 
EFs with nuisance parameter can also be interpreted as EDF models. In this section 
we discuss the single-parameter EDF; this is sufficient for our purposes, and vector- 
valued parameter extensions can be obtained in a canonical way. 


2.2.1 Definition and Properties 


The EFs of Sect.2.1 can be extended to EDFs. In the single-parameter case this 
is achieved by a transformation Y = X/qw, where w > 0 is a scaling and where X 
belongs to a single-parameter linear EF, i.e., with T (x) = x. We restrict ourselves to 
the single-parameter case k = 1 throughout this section. Choose a o-finite measure 
vı on R and a measurable function a; : R — R. These choices give a single- 
parameter linear EF, directly modeling a real-valued random variable T(X) = X. 
By (2.2) we have distribution for the single-parameter linear EF random variable X 


dF (x; 0, 1) = f(x;0, dvi (x) = exp {ox ~«@)+ ai(x)}dvi(x), 
on the effective domain 


O= fe er: [expt0x +0109) n(x) < oo}, (2.11) 
R 
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and with cumulant function 
dG€O rR kK) =log (/ exp {0x tawan) (2.12) 
R 


Throughout, we assume that the effective domain © has a non-empty interior Ò. 
Thus, since © is convex, we assume that @ is a non-empty (possibly infinite) open 
interval in R. 

Following Jørgensen [201, 202], we extend this linear EF to an EDF as follows. 
Choose a family of o -finite measures ve on R and measurable functions a, : R > 
R for a given index set W > œ with {1} C W C R+. Assume that we have an 
@-independent scaled cumulant function « on this index set W, that is, 


1 
GEOR k(0)=— (ioe f exp {0x + a(l dvala) for all œ € W, 
w R 


with effective domain © defined by (2.11), i.e., for œ = 1. This allows us to consider 
the distribution functions 


dF (x: 0,0) = f(x:0,)dv,(x) = exp {Ox — ox (0) + ao(x)dva(x) 
= exp [o @y — K(0)) + an(oy)|dvo(oy), (2.13) 


in the third identity we did a change of variable x > y = x/o. By re- 
parametrizing the function a,(@ -) and the o-finite measures v,,(@ -) slightly 
differently, depending on the particular structure of the chosen o-finite measures, 
we arrive at the following single-parameter EDF. 


Definition 2.12 The (single-parameter) EDF is given by densities of the form 
0—K(@) 
Y ~ f(y; 6, 0/9) = exp [2O ayy uo), (2.14) 


with 


k : © — R is the cumulant function (2.12), 
6 €@ isthe canonical parameter in the effective domain (2.11), 
v>0 isa given weight (exposure, volume), 
gy >O0 isthe dispersion parameter, 


a(-;-) is the normalization, not depending on the canonical parameter 0. 
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Remarks 2.13 


e Exposure v > 0 and dispersion parameter gy > 0 provide the parametrization 
usually used for w = v/y € W. Their meaning and interpretation will become 
clear below, and they will always appear as a ratio w = v/ọ. 

e The support of these EDF distributions does not depend on the explicit choice of 
the canonical parameter 6 € ©, but it may depend on w = v/ € W through 
the choices of the o-finite measures vo, for w € W. Consequently, a(y; œw) is 
a normalization such that f(y; 0,q@) integrates to 1 w.r.t. the chosen o-finite 
measure Vo to receive a proper distributional model. 

e The transformation x +> y = x/q in (2.13) is called duality transformation, see 
Section 3.1 in Jorgensen [203]. It provides the duality between the additive form 
(in variable x in (2.13)) and the reproductive form (in variable y in (2.13)) of the 
EDF; Definition 2.12 is the reproductive form. 

e Lemma 2.1 tells us that © is convex, thus, it is a possibly infinite interval in R. 
To exclude trivial cases we will always assume that the o -finite measure v; is not 
concentrated in one single point (this relates to the minimal representation for 
k = 1 in the linear EF case, see Assumption 2.6), and that the interior © of the 
effective domain © is non-empty. 


Corollary 2.14 Assume Ò is non-empty and that vı is not concentrated in 
one single point. Choose Y ~ F(-;0,v/@) for fixed 80 € ©. The moment 
generating function of Y for smallr € R satisfies 


My (r) = Eg [exp {rY}] = exp {= [k(@ +r@g/v) — I} : 


The first two moments of Y are given by 


u = Eo [Y] = «' (8) and Varo (Y) = am 
v 


(9) > 0. 


The cumulant function k is smooth and strictly convex on © with canonical 
link h = (k«’)~!. The variance function is defined by u +> V (u) = (k"oh)(u) 
and, consequently, for the variance of Y we have Vary, (Y) = £V(u) for 
u eM. 


Proof This follows analogously to Theorem 2.4. The linear case T (y) = y with vı 
not being concentrated in one single point guarantees that the minimal dimension is 
k = 1, providing a minimal representation in this dimension, see Assumption 2.6. 

o 


Before giving explicit examples we state the so-called convolution formula. 
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Corollary 2.15 (Convolution Formula) Assume Ò is non-empty and that vı is not 
concentrated in one single point. Assume that Y; ~ F(-; 0, vi/p) are independent, 
for 1 <i <n, with fixed € ©. Set v} = }`;_] vi. Then 


1 n 
Y+ = m a ~ FC; 0,v4/¢@). 
i=l 


Proof The proof immediately follows from calculating the moment generating 
function My, (r) and from using the independence between the Y;’s. o 


2.2.2 Exponential Dispersion Family Examples 


The single-parameter linear EF examples introduced above can be reformulated as 
EDF examples. 


Binomial Distribution as a Single-Parameter EDF 
For the binomial distribution with parameters p € (0,1) and n € N we choose 


the counting measure on {0, 1/n, ..., 1} with œ = n. Then we make the following 
choices 


Eri o =r) =1g( +), 


a(y) = log("" ), (6) =logd+e), p=«'(6)= 


for effective domain © = R and dual parameter space M = (0, 1). With these 
choices we have 


G6 j(i) {n (@y — log + =N z )( p 
fOrerml= u Jere ny loge) ANa ra 


This is a single-parameter EDF. The canonical link p + h(p) gives the logit 
function. Mean and variance are given by 


e? 


1 +e? 


1 e? 
n (1+ e8)? 


p = Eo [Y] = «'(@) = 


1 1 
and Vary (Y) = z" ®© = = ZPU — p), 


and the variance function is given by V (u) = u(1 — u). The binomial random 
variable is obtained by setting X = nY ~ Binom(n, p). 
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Poisson Distribution as a Single-Parameter EDF 


For the Poisson distribution with parameters à > O and v > 0 we choose the 
counting measure on No/v for exposure w = v. Then we make the following choices 


a(y) = log (i —). K(O)=e, A=K'(0)=e%, O =h(A) = log), 
for effective domain © = R and dual parameter space M = (0, 00). With these 


choices we have 


py puny” 
£036, v) = wl Ji — exp {v (oy —e°)} =e a Gal 


(2.15) 


This is a single-parameter EDF. The canonical link à + h(A) is the log-link. Mean 
and variance are given by 


1 1 1 
à = Eo [Y] = K' (0) = e? and Varo (Y) = =K" (0) = =e? = =A, 
v v v 


and the variance function is given by V (à) = A, that is, the variance function is 
linear in the mean parameter A. The Poisson random variable is obtained by setting 
X = vY ~ Poi(vd). We choose g = 1, here, meaning that we have neither under- 
nor over-dispersion. Thus, the choices v and g in w = v/@ have the interpretation 
of an exposure and a dispersion parameter, respectively. This interpretation is going 
to be important in claim counts modeling, below. 


Gamma Distribution as a Single-Parameter EDF 


For the gamma distribution with parameters a, 8 > 0 we choose the Lebesgue 
measure on R+ and shape parameter œw = v/g = a. We make the following choices 


a(y) = (a — 1)logy + aloga — logr (a), «(@) = —log(—8), 
u =K'(0)=—1/0, 0 = h(u) =—1/p, 
for effective domain © = (—oo, 0) and dual parameter space M = (0, 00). With 
these choices we have 
í Oa)” ai 


y = oe a—l fans Ee an fo 
f(39,@) = re” exp {a (yO + log(—0))} = Te)? exp {—(—0a)y}. 
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This is analogous to (2.6) with shape parameter a > 0 and scale parameter 6 = 
—6@ > 0. Mean and variance are given by 


1 1 
u = Eg [Y] =«'(0) = -07! and Varo (Y) = a) = aa 


and the variance function is given by V (u) = u?, that is, the variance function 
is quadratic in the mean parameter u. The gamma random variable is obtained by 
setting X = «Y ~T («, B). This gives us for the first two moments of X 


a a 1 
= Eo [X] = — and Varo (X) = — = + u%. 
ux = Eo [X] 2 o (X) Bo al x 


Suppose v = 1, for shape parameter œ > 1, we have under-dispersion g = 1/a < 1 
and the gamma density is unimodal; for shape parameter œ < 1, we have over- 
dispersion g = 1/a > 1 and the gamma density is strictly decreasing, we refer to 
Fig. 2.1. 


Inverse Gaussian Distribution as a Single-Parameter EDF 


For the inverse Gaussian distribution with parameters a, 8 > 0 we choose the 
Lebesgue measure on R+ and we set w = v/y = a. We make the following choices 


all? bed 1/2 

a(y) = log [==] — a «(0) = —(—20)1/?, 
, 1 1 

u= r (0) = opin’ Se 


for 9 € (—oo, 0) and dual parameter space M = (0, 00). With these choices we 
have 


qi/2 
(27m y3)1/2 


ale a 2 
= Orya P l-2 (1 = (-26)'/2y) | dy 


_ a ip Ca = , 
= Ona exp ee ae x x, 


f(y; 8, dy = exp fa (ey + (-28)'7) - + dy 
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where in the last step we did a change of variable y +> x = ay. This is exactly (2.8). 
Mean and variance are given by 


u = Eg [Y] =x’ (0) = (—20)71/? and Varo (Y) = Le) = 1 96-372, 
a a 


and the variance function is given by V (u) = ie, that is, the variance function is 
cubic in the mean parameter jz. The inverse Gaussian random variable is obtained by 
setting X = aY. The mean and variance of X are given by, set B = (—20)!/* > 0, 


a a 
= Eg [X] = — and Varo (X) = — = —u>. 
ux = Eo [X] 3 o (X) pr 


This inverse Gaussian density is illustrated in Fig. 2.2. 

Similarly to (2.9), we can extend the inverse Gaussian model to the boundary 
case 0 = 0), i.e., the effective domain © = (—oo, 0] is not open. This provides us 
with density 


2 
a a 
fy; 0 =0, a)dy = capac {SI ax, (2.16) 
using, as above, the change of variable y x = ay. An additional transformation 
x t+ 1/x gives a gamma distribution with shape parameter 1/2 and scale parameter 
2 
a“ /2. 


Remark 2.16 The inverse Gaussian case gives an example of a non-open effective 
domain © = (—oo, 0]. It is worth noting that for the boundary parameter 0 = 0, 
the first moment does not exist, i.e., Corollary 2.14 only makes statements in the 
interior © of the effective domain ©. This also relates to Remarks 2.9 on the dual 
parameter space M. 


2.2.3 Tweedie’s Distributions 


Tweedie’s compound Poisson (CP) model was introduced in 1984 by Tweedie [358], 
and it has been studied in detail in Jorgensen [202], Jorgensen—de Souza [204], 
Smyth—Jérgensen [342] and in the review paper of Delong et al. [94]. Tweedie’s CP 
model belongs to the EDF. We spend more time on explaining Tweedie’s CP model 
because it plays an important role in actuarial modeling. 

Tweedie’s CP model is received by choosing as o-finite measure vı a mixture of 
the Lebesgue measure on (0, oo) and a point measure in 0. Furthermore, we choose 
power variance parameter p € (1, 2) and cumulant function 


1 2=p 
K(O) = Kp (9) = Ter (1 — p0)? , (2.17) 
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on the effective domain 0 € © = (—o, 0). This provides us with Tweedie’s CP 
model 


ae F058, vig) = exp | te 
g/v 


with exposure v > 0 and dispersion parameter g > 0; the normalizing function 
a(-; v/g) does not have any simple closed form, we refer to Section 2.1 in 
Jgrgensen—de Souza [204] and Section 4.2 in Jorgensen [203]. 


+ aly; vo], 


The first two moments of Tweedie’s CP random variable Y are given by 


u = Eo [Y] = c, (0) = (1 — po) T7 € M = (0, œ), (2.18) 


Vara (Y) = O = z K= oT = eu? > 0. (2.19) 


The parameter p € (1, 2) determines the power variance functions V (u) = 
u”? between the Poisson p = 1 and the gamma p = 2 cases, see Sect. 2.2.2. 


The moment generating function of Tweedie’s CP random variable X = vY/g = 
oY in its additive form is given by, we use Corollary 2.14, 


v —0 pT 
Mx(r) = Myy/g(r) = exp į — Kp(8) ( ) —1 forr < —9. 
Q —0 -r 
Some readers will notice that this is the moment generating function of a CP 
distribution having i.i.d. gamma claim sizes. This is exactly the statement of the 
next proposition which is found, e.g., in Smyth-Jørgensen [342]. 


Proposition 2.17 Assume S = yy Zi is CP distributed with Poisson claim 
counts N ~ Poi(Av) and i.i.d. gamma claim sizes Zi ~ T (a, B) being independent 


of N. We have S @ vY/ọ by identifying the parameters as follows 


a+2 1 
ban e (1,2), Bp=-0>0 and WS ee 


+1 


Proof of Proposition 2.17 Assume S is CP distributed with i.i.d. gamma claim 
sizes. From Proposition 2.11 and Section 3.2.1 in Wiithrich [387] we receive that 
the moment generating function of S is given by 


Ms(r) = exp fav (( á J — :)) forr < $. 
por 
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Using the proposed parameter identification, the claim immediately follows. o 


Proposition 2.17 gives us a second interpretation of Tweedie’s CP model which 
was introduced in an EDF fashion, above. This second interpretation explains the 
name of this EDF model, it explains the mixture of the Lebesgue measure and the 
point measure in 0, and it also highlights why the Poisson model and the gamma 
model are the boundary cases in terms of power variance functions. 


An interesting question is whether the EDF can be extended beyond power 
variance functions V (u) = u?” with p €e [1,2]. The answer to this question is 
yes, and the full answer is provided in Theorem 2 of Jørgensen [202]: 


Theorem 2.18 (Jørgensen [202], Without Proof) Only power variance parame- 
ters p € (0, 1) do not allow for EDF models. 


Table 2.1 gives the EDF distributions that have a power variance function. These 
distributions are called Tweedie’s distributions, with the special case of Tweedie’s 
CP distributions for p € (1, 2). The densities for p € {0, 1, 2, 3} have a closed form, 
but the other Tweedie’s distributions do not have a closed-form density. Thus, they 
cannot explicitly be constructed as suggested in Sect. 2.2.1. Besides the constructive 
approach presented above, there is a uniqueness theorem saying that the variance 
function V(-) on the domain M characterizes the single-parameter linear EF, see 
Theorem 2.11 in Jørgensen [203]. This uniqueness theorem is the basis of the proof 
of Theorem 2.18. Tweedie’s distributions for p ¢ [0, 1]U{2, 3} involve infinite sums 
for the normalization exp{a(-, -)}, we refer to formulas (4.19), (4.20) and (4.31) in 
Jørgensen [203], this is the reason that one has to go via the uniqueness theorem 
to prove Theorem 2.18. Dunn—Smyth [112] provide methods of fast calculation 
of some of these infinite sums; in Sect. 5.5.2, below, we present an approximation 
(saddlepoint approximation). The uniqueness theorem is also useful to construct 
new examples within the EF, see, e.g., Section 2 of Awad et al. [15]. 


Table 2.1 Power variance function models V (u) = uP within the EDF (taken from Table 4.1 in 
Jgrgensen [203]) 


p Distribution Support of Y |© M 
p<0 Generated by extreme stable distributions |R [0, co) (0, co) 
p=0 Gaussian distribution R R R 
p=1 Poisson distribution No R (0, co) 
1<p<2_ | Tweedie’s CP distribution [0, co) (—00, 0) | (0, co) 
p=2 Gamma distribution (0, 00) (—00,0) | (0,00) 
p>2 Generated by positive stable distributions | (0, co) (—o0o, 0] | (0, co) 


p=3 Inverse Gaussian distribution (0, co) (—œ,0] | (0, co) 
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2.2.4 Steepness of the Cumulant Function 


Assume we have a fixed EF satisfying Assumption 2.6. All random variables T (Y) 
belonging to this EF have the same support, not depending on the particular choice 
of the canonical parameter 9 € ©. We denote this support of T (Y) by T. 

Below, we are going to estimate the canonical parameter 0 € © from data using 
maximum likelihood estimation. For this it is advantageous to have the property 
T C M, because, intuitively, this allows us to directly select Œ = T(Y) as the 
parameter estimate in the dual parameter space M, for a given observation T(Y) € 
T. This then translates to a canonical parameter 6 = h(t) = h(T(Y)) € O, using 
the canonical link h; this estimation approach will be better motivated in Chap. 3, 
below. Unfortunately, many examples of the EF do not satisfy this property {Cc M. 
For instance, in the Poisson model the observation T(Y) = Y = 0 is not included 
in M, see Table 2.1. This poses some challenges in parameter estimation, and the 
purpose of this small discussion is to be prepared for these challenges. 

A cumulant function « is called steep if for all 0 € © and all ĝ in the boundary 
of O 


(6 —6)' Vox (w0+(1—a)0) > © foray 0, (2.20) 
we refer to Formula (20) in Section 8.1 of Barndorff-Nielsen [23]. Define the convex 


closure of the support T by € = cony ($). 


Theorem 2.19 (Theorem 9.2 in Barndorff-Nielsen [23], Without Proof) Assume 
we have a fixed EF satisfying Assumption 2.6. The cumulant function x is steep if 
and only if € = M = Vox (O). 


Theorem 2.19 tells us that for a steep cumulant function we have € = M = 


Vok (Ô). In this case parameter estimation can be extended to observations T (Y) ¢ 
M such that we may obtain a degenerate model at the boundary of M. Coming 
back to our Poisson example from above, in this case we set @ = 0, which gives a 
degenerate Poisson model. 

Throughout this book we will work under the assumption that « is steep. 
The classical examples satisfy this assumption: the examples with power variance 
parameter p in {0} U [1, 00) satisfy Theorem 2.19; this includes the Gaussian, the 
Poisson, the gamma, the inverse Gaussian and Tweedie’s CP models, see Table 2.1. 
Moreover, the examples we have met in Sect.2.1 fulfill this assumption; these 
are the single-parameter linear EF models of the Bernoulli, the binomial and the 
negative binomial distributions, as well as the vector-valued parameter examples of 
the Gaussian, the gamma and the inverse Gaussian models and of the categorical 
distribution. The only models we have seen that do not have a steep cumulant 
function are the power variance models with p < 0, see Table 2.1. 


Remark 2.20 Working within the EDF needs some additional thoughts because the 
support T = To of the single-parameter linear EDF random variable Y = T (Y) may 
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depend on the specific choice of the dispersion parameter w € W D {1} through the 
o -finite measure dvo (w -), see (2.13). For instance, in the binomial case the support 
of Y is given by To = {0, 1/n,..., 1} with œ = n, see Sect. 2.2.2. 

Assume that the cumulant function « is steep for the single-parameter linear 
EF that corresponds to the single-parameter EDF with œ = 1. Theorem 2.19 
then implies that for this choice we have Cnn = Vox (O) with convex closure 
Co=1 = Conv (To=1). 

Consider w € W \ {1} which corresponds to the choice ve of the o -finite measure 
on R. This choice belongs to the cumulant function 0 > wx (0) in the additive form 
(x-parametrization in (2.13)). Since steepness (2.20) holds for any w > 0 we receive 
that the convex closure of the support of this distribution in the x-parametrization 


in (2.13) is given by Vow (0) = oVok (Ò). The duality transformation x œ> y = 
x/æ leads to the change of measure dv,,(x) > dvae(wy) and to the corresponding 
change of support, see (2.13). The latter implies that in the reproductive form (y- 
parametrization) the convex closure of the support does not depend on the specific 
choice of w € W. Since the EDF representation given in (2.14) corresponds to the 
y-parametrization (reproductive form), we can use Theorem 2.19 without limitation 
also for the single-parameter linear EDF given by (2.14), and € does not depend on 
w E€ W. 


2.2.5 Lab: Large Claims Modeling 


From Corollary 2.14 we know that the moment generating function exists around the 
origin for all examples belonging to the EDF. This implies that the moments of all 
orders exist, and that we have an exponentially decaying survival function Pg[Y > 
y] = 1 — FQ; 6,@) ~ exp{—oy} for some ọ > 0 as y > on, see (1.2). In many 
applied situations the data is more heavy-tailed and, thus, cannot be modeled by 
such an exponentially decaying survival function. In such cases one often chooses 
a distribution function with a regularly varying survival function; regular variation 
with tail index 6 > 0 has been introduced in (1.3). A popular choice is a log-gamma 
distribution which can be obtained from the gamma distribution (belonging to the 
EDF). We briefly explain how this is done and how it relates to the Pareto and the 
Lomax [256] distributions. 

We start from the gamma density (2.6). The random variable Z has a log-gamma 
distribution with shape parameter œ > 0 and scale parameter B = —0 > Oif 
log(Z) = Y has a gamma distribution with these parameters. Thus, the gamma 
density of Y = log(Z) is given by 


a 


f(y; B,@)dy = Eo exp{—By}dy  fory > 0. 
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We do a change of variable y +> z = exp{y} to receive the density of the log-gamma 
distributed random variable Z = exp{Y} 


Qa 


(logz)*—!z- Ft Yaz forz > 1. 


B 
f(z; B, w)dz = T@) 
This log-gamma density has support (1, 00). The distribution function of this log- 
gamma distributed random variable needs to be calculated numerically, and its 
survival function is regularly varying with tail index B > 0. 

A special case of the log-gamma distribution is the Pareto distribution. The Pareto 
distribution is more tractable and it is obtained by setting shape parameter a = 1 in 
the log-gamma density. This gives us the Pareto density 


f(z; B)dz = f(z; B, a = 1)dz = Bz~F Yaz forz > 1. 
The distribution function in this Pareto case is for z > 1 given by 
F(z; B)=1-z2%. 


Obviously, this provides a regularly varying survival function with tail index 6 > 0; 
in fact, in this case we do not need to go over to the limit in (1.3) because we 
have an exact identity. The Pareto distribution has the nice property that it is closed 
under thresholding (lower-truncation) with M, that is, we remain within the family 
of Pareto distributions with the same tail index $ by considering lower-truncated 
claims: for 1 < M < z we have 


F(@: p, M) =P[Z<2|Z > m= S754 | _ (=) 

P[Z > M] 

This is the classical definition of the Pareto distribution, and it allows to preserve 
full flexibility in the choice of the threshold M > 0. 

The disadvantage of the Pareto distribution is that it does not provide a 
continuous density on R+ as there is a discontinuity in threshold M. For this reason, 
one sometimes explores another change of variable Z + X = Z — M fora Pareto 
distributed random variable Z ~ F(-; 8, M). This provides the Lomax distribution, 
also called Pareto Type II distribution. X has the following distribution function on 
(0, 00) 


ya 
pix sa=1- (4 ) for x > 0. 


This distribution has again a regularly varying survival function with tail index 6 > 
0. Moreover, we have 


i , M\-* 
lim ———— = lim (1+ — =i 
40a (a) x>0O x 
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Fig. 2.3 Log-log plot of a log-log plot 
Pareto and a Lomax oJ 
distribution with tail index 

ß = 2 and threshold 

M = 1000000 


logged survival function 


Pareto distribution 


Lomax distribution 


This says that we should choose the same threshold M > 0 for both the Pareto and 
the Lomax distribution to receive the same asymptotic tail behavior, and this also 
quantifies the rate of convergence between the two survival functions. Figure 2.3 
illustrates this convergence in a log-log plot choosing tail index 6 = 2 and threshold 
M = 1'000'000. 

For completeness we provide the density of the Pareto distribution 


f(z; B, M) = Fy forz > M, 


and of the Lomax distribution 


—(B+1) 
foipmy= E (57) eee 


2.3 Information Geometry in Exponential Families 


We do a short excursion to information geometry. This excursion may look a bit 
disconnected from what we have done so far, but it provides us with important 
background information for the chapter on forecast evaluation, see Chap. 4, below. 


2.3.1 Kullback—Leibler Divergence 


There is literature in information geometry which uses techniques from differential 
geometry to study EFs as Riemannian manifolds with points corresponding to EF 
densities parametrized by their canonical parameters 0 € ©, we refer to Amari [10], 
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Ay et al. [16] and Nielsen [285] for an extended treatment of these mathematical 
concepts. 

Choose a fixed EF (2.2) with cumulant function « on the effective domain 
© c RÝ and with o-finite measure v on R. We define the Kullback—Leibler (KL) 
divergence (relative entropy) from model 6; € © to model 69 € © within this EF 
by 


D 36 =- 01) = ; Oo)l 
KL(f (C; OOF C 01)) [ to oog (209 


) dv(y) = 0. 


Recall that the support of the EF does not depend on the specific choice of the 
canonical parameter 0 in ©, see Remarks 2.3; this implies that the KL divergence 
is well-defined, here. The positivity of the KL divergence is obtained from Jensen’s 
inequality; this is proved in Lemma 2.21, below. 

The KL divergence has the interpretation of having a data model that is 
characterized by the distribution f (+; 0o), and we would like to measure how close 
another model f(-; 41) is to the data model. Note that the KL divergence is not 
a distance function because it is neither symmetric nor does it satisfy the triangle 
inequality. 

We calculate the KL divergence within the chosen EF 


Dau ( FCs PISCE 61) = | fo: 0) [0 — 81)" T(y) ~ eGo) +01) | dv(y) 
= (Go — 01)" Vo (Bo) — «(o) +Ø) = 0, 2.21) 


where we have used Corollary 2.5, and the positivity of the KL divergence can be 
seen from the convexity of «x. This allows us to consider the following (Taylor) 
expansion 


K(O1) = K (80) + Vox (80) ' (01 — 80) + Dei (f(s BOF C 1))- (2.22) 


This illustrates that the KL divergence corresponds to second and higher order 
differences between the cumulant value «x (0o) and another cumulant value « (01). 
The gradients of the KL divergence w.r.t. 01 in 01 = ĝo and w.r.t. Oo in 09 = 0, are 
given by 


Vo, Dei (FCs OOF 91) |g, 9, (2.23) 
= Voo DEL f C; GI FC: 91))|9,-0, = 0. 


This emphasizes that the KL divergence reflects second and higher-order terms in 
cumulant function «; and that the data model 69 forms the minimum of this KL 
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divergence (as a function of 01) as we will just see. We calculate the Hessian (second 
order term) w.r.t. 0; in 0, = 00 


def. 


Vp, DKL G OIL OD), y = Ver], , = TO). 


0 


The positive definite matrix Z (0o) (in a minimal representation) is called Fisher’s 
information. Fisher’s information is an important tool in statistics that we will 
meet in Theorem 3.13 of Sect.3.3, below. A function satisfying (2.21) (with 
being zero if and only if 09 = 61), fulfilling (2.23) and having positive definite 
Fisher’s information is called divergence, see Definition 5 in Nielsen [285]. Fisher’s 
information Z (0o) measures the curvature of the KL divergence in 09 and we have 
the second order Taylor approximation 


1 
K(01) © K(80) + Vox (80)! (01 — 00) + 3% — 00)" T(60) (01 — 00). 


Next-order terms are obtained from the so-called Amari—Chentsov tensor, see Amari 
[10] and Section 4.2 in Ay et al. [16]. In information geometry one studies the 
(possibly degenerate) Riemannian metric on the effective domain © induced by 
Fisher’s information; we refer to Section 3.7 in Nielsen [285]. 


Lemma 2.21 Consider two densities p and q w.rt. a given o -finite measure v. We 
have Dx (p\|q) = 0, and Dx (p\|q) = 0 if and only if p = q, v-a.s. 


Proof Assume Y ~ pdv, then we can rewrite the KL divergence, using Jensen’s 
inequality, 


po) i | (4 .) | 
D = l —]d = —E,]1 aad 
KL(p\|qg) f p(y)log (z> v(y) p | 108 (Y) 


Y 
> log |] = —log f godo) = 0. CA) 


Equality holds if and only if p = q, v-a.s. The last inequality of (2.24) considers 
that q does not necessarily need to be a density w.r.t. v, i.e., we can also have 


faQ)dv(y) <1. o 


2.3.2 Unit Deviance and Bregman Divergence 


In the next chapter we are going to introduce maximum likelihood estimation for 
parameters, see Definition 3.4, below. Maximum likelihood estimators are obtained 
by maximizing likelihood functions (evaluated in the observations). Maximizing 
likelihood functions within the EDF is equivalent to minimizing deviance loss 
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functions. Deviance loss functions are based on unit deviances, which, in turn, 
correspond to KL divergences. The purpose of this small section is to discuss this 
relation. This should be viewed as a preparation for Chap. 4. 

Assume we work within a single-parameter linear EDF, i.e., T(y) = y. Using 
the canonical link h we obtain the canonical parameter 0 = h(w) € © CR 
from the mean parameter u € M. If we replace the (typically unknown) mean 
parameter u by an observation Y, supposed Y € M, we get the specific model 
that is exactly calibrated to this observation. This provides us with the canonical 
parameter estimate by = h(Y) for 0. We can now measure the KL divergence from 
any model represented by 0 to the observation calibrated model by = h(Y). This 
KL divergence is given by (we use (2.21) and we set w = v/g = 1) 


ds dy, 1 
Da (FO). DILL A) = f fo; By, Diog (LD r Lw D) avin 
R fO; 0,1) 


= (h(Y)— 0) Y — kc (h(Y)) +K) > 0 


This latter object is the unit deviance (up to factor 2) of the chosen EDF. It plays a 
crucial role in predictive modeling. 


We define the unit deviance under the assumption that « is steep as follows: 
d:€xM— Ry (2.25) 


O, W) > D0, u) = 2 (yh) — « hO) = yh) +e (h(W))) = 0 


where € is the convex closure of the support { of Y and M is the dual parameter 
space of the chosen EDF. Steepness of « implies č = M, see Theorem 2.19. 

This unit deviance 0 is received from the KL divergence, and it is (twice) the dif- 
ference of two log-likelihood functions, one using canonical parameter h (y) and the 
other one having any canonical parameter 0 = h(ju) € ©. That is, for u = K' (0), 


D, u) = 2 Def hO), DIF C9 D) (2.26) 
=2 z (log f(y; h(y), v/g) — log f(y; 0, v/9)), 


for general w = v/g € W. The latter can be rewritten as 


f(y; 8, v/g) = fO; AQ), v/p) exp | sz OY, K on}. (2.27) 


2p/v 


This looks like a generalization of the Gaussian distribution, where the square 
difference (y — u)? in the exponent is replaced by the unit deviance 0(y, u) with 
u = K' (0). This interpretation gets further support by the following lemma. 
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Lemma 2.22 Under Assumption 2.6 and the assumption that the cumulant function 
k is steep, the unit deviance d (y, u) > 0 of the chosen EDF is zero if and only if 
y = u. Moreover, the unit deviance d (y, p) is twice continuously differentiable 
wrt. (y, p) in È x M, and 


370 (Y, u) 


_ Yw) 
E EE) 


_ Du) 
: aa ESO 
y= dy 


ðuðy 


=2/V(w) > 0. 


y=ųu y=4 


Proof The positivity and the if and only if statement follows from Lemma 2.21 and 
the strict convexity of «. Continuous differentiability follows from the smoothness 
of « in the interior of ©. Moreover we have 


870 (Y, u) 


E = 2h' (u) = 2/K" (h(u)) = 2/ V (u) > 0, 


y=u 


a , j 
= ga W + wh’) 


Dl od 


where V(jz) is the variance function of the chosen EDF introduced in Corol- 
lary 2.14. The remaining second derivatives are received by similar (straightfor- 
ward) calculations. oO 


Remarks 2.23 


e Lemma 2.22 shows that the unit deviance definition of 0(y, u) provides a so- 
called regular unit deviance according to Definition 1.1 in Jørgensen [203]. 
Moreover, any model that can be brought into the form (2.27) for a (regular) unit 
deviance is called (regular) reproductive dispersion model, see Definition 1.2 of 
Jørgensen [203]. 

e In general the unit deviance 0(y, jz) is not symmetric in its two arguments y and 
u, we come back to this in Fig. 11.1, below. 


More generally, the KL divergence and the unit deviance can be embedded into 
the framework of Bregman loss functions [50]. We restrict to the single-parameter 
EDF case. Assume that y : € > Risa strictly convex function. The Bregman 
divergence w.r.t. y between y and u is defined by 


Dyo, WM =v) - vm) -WWwiy- Hw) = 9, (2.28) 


where y’ is a (sub-)gradient of y. The lower bound holds because of convexity of 
w. Consider the specific choice y(u) = uh(u) — K (h(u)) for the chosen EDF. 
Similar to Lemma 2.22 we have Y” (u) = h'(u) = 1/V (u) > 0, which says that 
this choice is strictly convex. Using this choice for y gives us unit deviance (up to 
factor 1/2) 


1 
Dy (Y, u) = YAO) — KAO) + C Alu) — AY = 500, u). (2.29) 
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Thus, the unit deviance 0 can be understood as a difference of log-likelihoods 
(2.26), as a KL divergence Dx and as a Bregman divergence Dy. 


Example 2.24 (Poisson Model) We start with a single-parameter EF example. 
Consider cumulant function «(9) = exp{0} for canonical parameter 0 € © = R, 
this gives us the Poisson model. For the KL divergence from model 6; to model 6 
we receive 


DrKL(f C; OI FCs 01)) = exp{Ai} — exp{o} — (1 — 60) exp{Ao} = 9, 
which is zero if and only if 0) = 01. Fisher’s information is given by 
T(0) = k" (0) = exp{0} > 0. 


If we have observation Y > 0 we receive a model described by canonical parameter 
Oy = h(Y) = log(Y). This gives us unit deviance, see (2.26), 


OY, u) = 2DkKL( f C; AW), DIFC: 0, 1) 
= 2 (e? — Y — (0 — log(Y))Y) 


=2(u-Y -Yig (£)) > 0, 


with u = «'(9) = exp{9}. This Poisson unit deviance will commonly be used for 
model fitting and forecast evaluation, see, e.g., (5.28). a 


Example 2.25 (Gamma Model) The second example considers a vector-valued 
parameter EF example. We consider the cumulant function «(@) = logr (62) — 
@2log(—81) for 0 = (01, 62)' € © = (—ow, 0) x (0, 00); this gives us the gamma 
model, see Sect. 2.1.3. For the KL divergence from model 8; to model 09 we receive 


Dxi(f : Oo) II fC 01)) = (60,2 — 41,2) 


T’(o,2) lo (e2) 
T (60,2) Tr (81,2) 


—0 —0 
ENEE 


Fisher’s information matrix is given by 


=O ae 
ee _ | D? —01 
T0) = Vox (0) = ( Eu ponr re) . 
=i T0)? 
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The off-diagonal terms in Fisher’s information matrix Z(@) are non-zero which 
means that the two components of the canonical parameter 0 interact. Choosing 
a different parametrization u = 02/(—01) (dual mean parametrization) and a = 62 
we receive diagonal Fisher’s information in (u, œ) 


I(un, œ) = E a ey ) = (i P i ) 3 (2.30) 
u 4 0 Wa) — 1 


Tay 


where W is the digamma function, see Footnote 2 on page 22. This transformation 
is obtained by using the corresponding Jacobian matrix for variable transformation; 
more details are provided in (3.16) below. In this new representation, the parameters 
u and g are orthogonal; the term W’(a@) — + is further discussed in Remarks 5.26 
and Remarks 5.28, below. 

Using this second parametrization based on mean yz and dispersion 1/a, we 
arrive at the EDF representation of the gamma model. This allows us to calculate the 
corresponding unit deviance (within the EDF), which in the gamma case is given by 


ay, w) =2 (F-14108 (4 )) 


IV 
> 


E 
Example 2.26 (Inverse Gaussian Model) Our final example considers the inverse 
Gaussian vector-valued parameter EF case. We consider the cumulant function 


K (0) = —2(6162)'/? — slog(—262) for @ = (01, 62)" € © = (—00, 0] x (—00, 0), 
see Sect. 2.1.3. For the KL divergence from model 0; to model ĝo we receive 


DKL fC; ODIFE O1)) = —A1, Te 


60,2 — 01,2 60,2 
m 1 > 0. 
g —200,2 i 2 oe mre E 


Fisher’s information matrix is given by 


— 24 01,101,2 


(202) !/2 1 
— y2 _ (—26)3/2 20102) 72 
I0) = Vex (0) = i (—26,)i/ A > 


2010) (—26,)3/2 “" (—26)2 


Again the off-diagonal terms in Fisher’s information matrix Z(@) are non-zero in 
the canonical parametrization. We switch to the mean parametrization by setting 
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u = (—262/(—261))'/* and æ = —26>. This provides us with diagonal Fisher’s 
information 


T(uay={" , |). (2.31) 
0 52 


This transformation is again obtained by using the corresponding Jacobian matrix 
for variable transformation, see (3.16), below. We compare the lower-right entries 
of (2.30) and (2.31). Remark that we have first order approximation of the digamma 
function 


1 
Wa) ~ loga — Ta’ 


and taking derivatives says that these entries of Fisher’s information are first order 
equivalent; this is also used in the saddlepoint approximation in Sect. 5.5.2, below. 
Using this second parametrization based on mean u and dispersion 1/a, we arrive 
at the EDF representation of the inverse Gaussian model with unit deviance 


2 
(Y — pw) s 


0. 
RY ~ 


d(Y, u) = 


More examples will be given in Chap. 4, below. 
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Chapter 3 A 
Estimation Theory gsti 


This chapter gives an introduction to decision and estimation theory. This intro- 
duction is based on the books of Lehmann [243, 244], the lecture notes of Künsch 
[229] and the book of Van der Vaart [363]. This chapter presents classical statistical 
estimation theory, it embeds estimation into a historical context, and it provides 
important aspects and intuition for modern data science and predictive modeling. 
For further reading we recommend the books of Barndorff-Nielsen [23], Berger 
[31], Bickel-Doksum [33] and Efron—Hastie [117]. 


3.1 Introduction to Decision Theory 


We start from an observation vector Y = (Y1,..., Yn)! taking values in a 
measurable space Y C R”, where n € N denotes the number of components Y;, 
1 < i < n, in Y. Assume that this observation vector Y has been generated by a 
distribution belonging to the family P = {P(-; 0); 6 € ©} being parametrized by a 
parameter set O. 


Remarks 3.1 There are some subtle points in the notation that we are going to 
use. We use P(-; 0) for the distribution of the observation vector Y, and if we 
consider a specific component Y; of Y we will use the notation Y; ~ F(-; 0). We 
make this distinction as in estimation theory one often considers i.i.d. observations 
Y; ~ F(; 6), 1 <i < n, with (in this case) joint product distribution Y ~ P(-; 0). 
This latter distribution is then used for purposes of maximum likelihood estimation, 
etc. The family P is parametrized by 0 € ©, and if we want to emphasize that 
this parameter is a k-dimensional vector we use boldface notation 0, this is similar 
to the EFs introduced in Chap. 2, but in this chapter we do not restrict to EFs. 
Finally, we assume identifiability meaning that different parameters 6 give different 
distributions P(-; 0) € P. 
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To fix ideas, assume we want to determine y (0) of a given functional y(-) on ©. 
Typically, the true value 0 € © is not known, and we are not able to determine y (0) 
explicitly. Therefore, we try to estimate y (0) from data Y ~ P(-; 0) that belongs to 
the same 6 € ©. As an example we may think of working in the EDF of Chap. 2, 
and we are interested in the mean u = Eo[Y] = «’(O) of Y. Thus, we aim at 
determining y (9) = «’(@). If the true @ is unknown, and if we have an observation 
Y from this model, we can try to estimate y (0) = «'(9) from Y. This motivation 
is based on estimation of y (0), but the following framework of decision making is 
more general, for instance, it may also be used for statistical hypothesis testing. 

Denote the action space of possible decisions (actions) by A. In decision theory 
we are looking for a decision rule (action rule) 


A:Y>A, Y» AY), (3.1) 


which should be understood as an educated guess for y (0) based on observation Y. 
A decision rule is evaluated in terms of a (given) loss function 


L:OxA—>R,4, (0,a) > L(@,a) > 0. (3.2) 


L(@, a) describes the loss of an action a € A w.r.t. a true parameter choice 6 € ©. 
The risk function of decision rule A for data generated by Y ~ P(-; 0) is defined by 


8 > RO, A) = Eo[L@, A(Y))] = f L0, AO) dPO; 9), (3.3) 


where Eg is the expectation w.r.t. the probability distribution P(-; 0). Risk func- 
tion (3.3) describes the long-term average loss of using decision rule A. As an 
example we may think of estimating y(@) for unknown (true) parameter 0 by a 
decision rule Y +> A(Y). Then, the loss function L(@, A(Y)) should describe the 
estimation loss if we consider the discrepancy between y (0) and its estimate A(Y), 
and the risk function R(0, A) is the average estimation loss in that case. 

Good decision rules A should provide a small risk R(@, A). Unfortunately, this 
statement is of rather theoretical nature because, in general, the true data generating 
parameter 0 is not known and the goodness of a decision rule for the true parameter 
cannot be evaluated explicitly, but the risk can only be estimated (for instance, using 
a bootstrap approach). Moreover, typically, there does not exist a uniformly best 
decision rule A over all 0 € ©. For these reasons we may (just) try to eliminate 
decision rules that are obviously not good. We give two introductory examples. 


Example 3.2 (Minimax Decision Rule) Decision rule A is called minimax if for all 
alternative decision rules A : Y —> A we have 


sup R(0, A) < sup R(0, A). 
cO ZAS] 
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A minimax decision rule is the best choice in the worst case of the true 9, i.e., it 
minimizes the worst case risk. E 


Example 3.3 (Bayesian Decision Rule) Assume we are given a distribution 7 on 
©. Decision rule A is called Bayesian w.r.t. x if it satisfies 


A= argmin | R(6, Adr (6). 
Zz Jo 


Distribution z is called prior distribution on ©. a 


The above examples give two possible choices of decision rules. The first one 
tries to minimize the worst case risk, whereas the second one uses additional knowl- 
edge in terms of a prior distribution z on ©. This means that we impose stronger 
assumptions in the second case to get stronger conclusions. The difficult part in 
practice is to justify these stronger assumptions in order to validate the stronger 
conclusions. Below, we are going to introduce other criteria that should be satisfied 
by good decision rules, an important one in estimation will be unbiasedness. 


3.2 Parameter Estimation 


This section focuses on estimating the (unknown) parameter 0 € © from observa- 
tion Y ~ P(-; 0). For this we consider decision rules A : Y —> A = © with A(Y) 
estimating 0. We assume there exist densities p(-; 0) w.r.t. a fixed o-finite measure 
von Y c R”, 


dP(y; 0) = p(y; 0)dv(y), 


for all distributions P (-; 0) € P, i.e., all 0 € ©. 


Definition 3.4 (Maximum Likelihood Estimator, MLE) The maximum 
likelihood estimator (MLE) of 0 for a given observation Y € Y is given by 
(subject to existence and uniqueness) 
QE = argmax p(Y; 6) = argmax ly (6), 
co co 


where the log-likelihood function of p(Y; 0) is defined by 0 => £y(0) = 
log p(Y; 0). 
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The MLE Y +> @MLE — @MLE(y) = A(Y) is nothing else than a specific 
decision rule with action space A = © for estimating 0. We can now start to explore 
the risk function R(0, @MLE) of that decision rule for a given loss function L. 


Example 3.5 (MLE within the EDF) We emphasize that this example is used 
throughout these notes. Assume that the (independent) components of Y = 
(%,.--,Yn)! ~ P(; 6) follow a given EDF distribution. That is, we assume that 
Y|,..., Yn are independent and have densities w.r.t. o-finite measures on R given 
by, see (2.14), 


i9 — K(0 
Yie fon 8, ujo = exp | AO 
p/ vi 


+a(yi; ula} , 
for 1 < i < n. Note that these random variables are not i.i.d. because they may 
differ in exposures v; > 0. Throughout, we assume that Assumption 2.6 is fulfilled 
and that the cumulant function « is steep, see Theorem 2.19. For the latter we also 
refer to Remark 2.20: the supports Ty; /ọ of Y; may differ; however, these supports 
share the same convex closure. 

Independence between the Y;’s implies that the joint probability P(-; 0) is the 
product distribution of the individual distributions F(-;0,v;/g), 1 < i < n. 
Therefore, the MLE of 0 in the EDF is found by solving 


n 


= argmax ty) = argmax X 
cO co f= p/Vi 


MLE Y;ð — «@) 


Since the cumulant function « is strictly convex we receive the MLE (subject 
to existence) 


MLE _ QMLE(y) = (K/)-! (=e uit) = e 22) l 


n n 
i=] Vi Der Ui 


Thus, the MLE is received by applying the canonical link h = (k’)~!, see 
Definition 2.8, and strict convexity of « implies that the MLE is unique. However, 
existence needs to be analyzed more carefully! It may happen that the MLE OMLE is 
a boundary point of the effective domain © which may not exist (if © is open). We 
give an example. Assume we work in the Poisson model presented in Sect. 2.1.2. 
The canonical link in the Poisson model is the log-link u bh h(n) = log(u), for 
u > 0. With positive probability we have in the Poisson case )°/_, uj Y; = 0. 
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Therefore, with positive probability the MLE @MLE does not exist (we have a 
degenerate Poisson model in that case). 

Since the canonical link is strictly increasing we can also perform MLE in the 
dual (mean) parametrization. The dual parameter space is given by M = x’(O), 
see Remarks 2.9, with mean parameters u = «'(0) € M. This motivates 


L Yih) KhA 
aE = arg max ly (h) = argmax yy OL (3.4) 
jieM jieM p/vi 


fail 


Subject to existence, this provides the unique MLE 


n 
i=) Vi Yi 
MLE MLE (y) dizi iti 


= : 3.5 
u Tat (3.5) 


Also this dual MLE does not need to exist (in the dual parameter space M). 
Under the assumption that the cumulant function « is steep, we know that the closure 
of the dual parameter space M contains the supports Tv; Jọ Of Yi, see Theorem 2.19 
and Remark 2.20. Thus, in that case we can close the dual parameter space and 
receive MLE QM e M (in a possibly degenerate model). In the aforementioned 
degenerate Poisson situation we receive @MME = 0 which is in the boundary 0M of 
the dual parameter space. a 


Definition 3.6 (Bayesian Estimator) The Bayesian estimator of 0 for a given 
observation Y € Y and a given prior distribution z on © is given by (subject to 
existence) 


gpayes = GBass(Y) = Ex[O|Y], 


where the conditional expectation on the right-hand side is calculated under the 
posterior distribution 7(0|y) x p(y; 0) (0) for a given observation Y = y. 


Example 3.7 (Bayesian Estimator) Assume that A = © = R and choose the square 
loss function L(@,a) = (0 — a)*. Assume that for v-a.e. y € Y the following 
decision rule A : Y —> A exists 


A(y) = argmin E,[(@ —a)7|Y = yl, (3.6) 
acA 
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where the expectation is calculated w.r.t. the posterior distribution 2(@|y). In 
this case, A is a Bayesian decision rule w.r.t. 7 and L(6,a) = (8 —a)?: by 
assumption (3.6) we have for any other decision rule A : Y > A, v-a.s., 


i (@ — A(Y))*1¥ = y] < Exl(@ — AYD? = yl. 


Applying the tower property we receive for any other decision rule A 


f RO, A)dz (0) = E[(6 — A(¥))*] < E[@ — A(¥))”] =] RO, A)dr(6), 
o o 


where the expectation E is calculated over the joint distribution of Y and 0. This 
proves that A is a Bayesian decision rule w.r.t. m and L(0,a) = (0 — a)”, see 
Example 3.3. Finally, note that the conditional expectation given in Definition 3.6 is 
the minimizer of (3.6). This justifies the name Bayesian estimator in Definition 3.6 
(for the square loss function). The case of the Bayesian estimator for a general loss 
function L is considered in Theorem 4.1.1 of Lehmann [244]. | 


Definition 3.8 (Method of Moments Estimator) Assume that © C Ré and that 
the components Y; of Y are i.i.d. F'(-; 0) distributed with finite k-th moments for all 
0 € ©. The law of large numbers provides, a.s., for all 1 < l < k, 


. 1 : l N l 
D a 


i=] 


Assume that the following map is invertible (on suitable range definitions for (3.7)— 


(3.8)) 


y:0-R*, 6+ y(@) = (EolY1], ..., EelYi])'. (3.7) 


The method of moments estimator of 0 is defined by 


n n T 
AMM 4MM fl 1 k 
0 = 0 Y) = -> >, Y: : 3.8 
ne (; i=l i i ) e 


The MLE, the Bayesian estimator and the method of moments estimator are the 
most commonly used parameter estimators. They may have additional properties 
(under certain assumptions) that we are going to explore below. In the remainder of 
this section we give an additional view on estimators which is based on the empirical 
distribution of the observation Y. 
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Assume that the components Y; of Y are real-valued and i.i.d. F distributed. The 


empirical distribution induced by the observation Y = (Y1, ..., Yn)! is given by 
1 n 
Fy”) =- )* ly, fory € R, 3.9 
n(y) iF 2 {Yi<y} y (3.9) 


we also refer to Fig. 1.2 (lhs). The Glivenko—Cantelli theorem [64, 159] tells us that 
the empirical distribution F, converges uniformly to F, a.s., for n — oo. 


Definition 3.9 (Fisher-Consistency) Denote by ‘8 the set of all distribution 
functions on the given probability space. Let Q : P — © be a functional with 
the property 


O(F(-; 0)) =0 forall F(; 0) € F ={F(;0);, 0 € O} CH. 


Such a functional is called Fisher-consistent for F and 0 € O, respectively. 


A given Fisher-consistent functional Q motivates the estimator 6= O(Fn) cO. 
This is exactly what we have applied for the method of moments estimator (3.8) 
with Fisher-consistent functional induced by the inverse of (3.7). The next example 
shows that this also works for MLE. 


Example 3.10 (MLE and Kullback—-Leibler (KL) Divergence) The MLE can be 
received from a Fisher-consistent functional. Consider for F € %$ the functional 


Q(F) = arg max f log f(y; Dd FO), 
0 


assuming that f(-; 6) are densities w.r.t. a o-finite measure on R. Assume that F 
has density f w.r.t. the o-finite measure v on R. Then, we can rewrite the above as 


fO) 
fO: 8) 


Q(F) = arg min I log ( ) f(y)dv(y) = argmin Dai (fIIf C 9). 
6 6 


The latter is the Kullback—Leibler (KL) divergence which we have met in Sect. 2.3. 
Lemma 2.21 states that the KL divergence is non-negative, and it is zero if and only 
if the two densities f and f (-; 6) are identical, v-a.s. This implies that Q(F(-; 0)) = 
0. Thus, Q is Fisher-consistent for 0 € ©, assuming identifiability, see Remarks 3.1. 


Next, we use this Fisher-consistent functional (KL divergence) to receive the 
MLE. Replace the unknown distribution F by the empirical one to receive 


O(Fn) = argmin Dex (fall fC; 9) 
6 


1 n z 
= arg max — X log f(Yi;0) = gMLE, 
z n 


i=l 
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where we have used that the empirical density f allocates point masses of size 1/n 
to the i.i.d. observations Y1, ..., Y,. Thus, the MLE OMLE of 0 can be obtained by 
choosing the model f (-; 8), ð € ©, that is closest in KL divergence to the empirical 
distribution F, of i.i.d. observations Y; ~ F. Note that in this construction we do 
not assume that the true distribution F is in F, see Definition 3.9. E 


Remarks 3.11 


e Many properties of estimators of 0 are based on properties of Fisher-consistent 
functionals Q (in cases where they exist). For instance, asymptotic properties as 
n — o are obtained from smoothness properties of Fisher-consistent functionals 
Q, or using the influence function we can analyze the impact of individual 
observations Y; on decision rules ô = a(Y) = O(F,). The latter is the basis of 
robust statistics, see Huber [194] and Hampel et al. [180]. Since Fisher-consistent 
functionals do not require that the true distribution belongs to F it requires a 
careful consideration of the quantity to be estimated. 

e The discussion on parameter estimation has implicitly assumed that the true data 
generating model belongs to the family P = {P(-; 6); @ € ©}, and the only 
problem was to find the true parameter in ©. More generally, one should also 
consider model uncertainty w.r.t. the chosen family P, i.e., the data generating 
model may not belong to this family. Of course, this problem is by far more 
difficult. We explore this in more detail in Sect. 11.1.4, below. 


3.3 Unbiased Estimators 


We introduce the property of uniformly minimum variance unbiased (UMVU) for 
decision rules in this section. This is a very attractive property in insurance pricing 
because it gives a quality statement to decision rules (and to the resulting prices). At 
the current stage it is not clear how unbiasedness is related, e.g., to the MLE of 6. 


3.3.1 Cramér—-Rao Information Bound 


Above we have stated some quality criteria for decision rules like the minimax 
property. A crucial property in financial applications is the so-called unbiasedness 
(for mean estimates) because this guarantees that the overall (price) levels are 
correctly specified. 
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Definition 3.12 (Uniformly Minimum Variance Unbiased, UMVU) A 
decision rule A : Y > A = R is unbiased for y : © — R if for all 
Y ~ P(-; 8), 6 € O, we have 


Eg[A(Y)] = y (0). (3.10) 


The decision rule A is called UMVU for y if additionally to the unbiased- 
ness (3.10) we have 


Varg (A(¥)) < Varo (A(Y)), 


for all 0 € © and for any other decision rule A: Y > R that is unbiased for 
y. 


Note that unbiasedness is not invariant under transformations, i.e., if A(Y) is 
unbiased for y (0), then, in general, b(A(Y)) is not unbiased for b(y(0)). For 
instance, if b is strictly convex then we get a counterexample by simply applying 
Jensen’s inequality. 

Our first step is to derive a general lower bound for Varg(A(Y)). If this general 
lower bound is met for an unbiased decision rule A for y, then we know that it 
is UMVU for y. We start with the one-dimensional case given in Section 2.6 of 
Lehmann [244]. 


Theorem 3.13 (Cramér—Rao Information Bound) Assume that the distri- 
butions P(-; 0), 0 € ©, have densities p(-; 0) for a given o -finite measure v 
on Y, and that © C R is an open interval such that the set {y; p(y; 0) > 0} 
does not depend on 0 € ©. Let A(Y) be unbiased for y : © — R having 
finite second moment. If the limit 


aes ue onin 1 p(y9 +A) — pV; 9) 
A ee EE are A p(y: 8) 


exists in L2(P(-; 0)) and if 
J 2 
Z(0) = Eo (sieeve D) l € (0, oo), 


then the function 0 +> y(@) is differentiable, Els log p(Y; 0)] = 0 and we 
have information bound 


y'(0)? 


Varg(A(Y)) > ZO)” 


58 3 Estimation Theory 


Proof We start from an arbitrary function Ww : © x Y — R with finite variance 
Varg(w(@, Y)) € (0, co) for all 0 € ©. The Cauchy—Schwarz inequality implies 


Cova (A(¥), Y (0, Y))* 


If we manage to make the right-hand side of (3.11) independent of decision rule 
A(-) we have a general lower bound, we also refer to Theorem 2.6.1 in Lehmann 
[244]. 

The Cauchy-Schwarz inequality implies that for any U € L?(P(-;6)) the 
following limit exists and is equal to 


ee | 
lim Ww a E 7E T EE e 


_ | 9 
A=0 A p(y; ) = Ke Ea p(y; ou] ; (3.12) 


06 


Setting U = 1 gives average score eal log p(Y; @)] = 0 because for sufficiently 
small A 


, jee 6+ A) — p(y; 2] 
° pY; 0) 


596+ A)— 50 
a) Ee PO Gy, odv) = 0, 
Y p(y; 0) 


where we have used that the support of the random variables does not depend on 0 
and that the domain © of @ is open. 

Secondly, we set U = A(Y) in (3.12). We have similarly to above using 
unbiasedness w.r.t. y 


py; @)dv(y) 


Covo (40. eee -f ag Z&+ A - p0: 0) 
Y 


D(Y; @) p(y; 9) 
=y(@+A)—y@). 


Existence of limit (3.12) provides the differentiability of y. Finally, from (3.11) we 
have 


2 
(Y;0+A)—p(¥;0) 
Cree oa). ver 


a (3.13) 
p(Ws0-+A)—pV:6) TO) 
Varg ( pd) ) 


Varg(A(Y)) > lim 
A->0 


This completes the proof. o 


Remarks 3.14 (Fisher’s Information and Score) 


¢ T(0) is called Fisher’s information or Fisher metric. 

e s(6,Y) = os log p(Y; 0) is called score, and Eg[s(Y; 0)] = 0 in Theorem 3.13 
expresses that the average score is zero under the assumptions of that theorem. 

e Under the regularity conditions of Lemma 6.1 in Section 2.6 of Lehmann [244] 
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a 2 32 
T(0) = Eo (Bosra: ») |- 2 0 | low oY o]. (3.14) 


Fisher’s information Z (0) expresses the variance of the score s(6, Y). Iden- 
tity (3.14) justifies the notion Fisher’s information in Sect. 2.3 for the EF. 
e In order to determine the Cramér—Rao information bound for unknown 6 we 
need to estimate Fisher’s information Z (0) from the available data. There are 
two different ways to do so, either we choose 


pr 3 2 
I@ = E; s| (Hien D) |. 


or we choose the observed Fisher’s information 


pare ə 2 
T@) = (= log p(Y; D) 


6=0 
for given data Y and where @ = OY ). Both estimated Fisher’s information Z (6) 
and T © play a central role in MLE of generalized linear models (GLMs). They 
are used in Fisher’s scoring method, the iterated re-weighted least squares (IRLS) 
algorithm and the Newton—Raphson algorithm to determine the MLE. 

e The Cramér-Rao information bound in Theorem 3.13 is stated in terms of the 
observation Y ~ p(-; 0). Assume that the components Y; of Y are i.i.d. f (; 0) 
distributed. In this case, Fisher’s information scales as 


T0) = La (0) = nT, (0), (3.15) 


with single risk’s Fisher’s information (contribution) 


9 2 
T(0)=E | (Ze rie) | 


In general, Fisher’s information is additive in independent random variables, 
because the product of densities is additive after applying the logarithm, and 
because the average score is zero. 
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Proposition 3.15 The unbiased decision rule A for y attains the Cramér— 
Rao information bound if and only if the density is of the form p(y; 0) = 
exp {ô(0)T (y) — (0) + a(y)} with T = A. In that case we have y(@) = 
B'(6)/6'(@). 


Proof of Proposition 3.15 The Cauchy—Schwarz inequality provides equality 
in (3.13) if and only if 4 log p(y; 0) = 5’(0) A(y)— B’ (8), v-a.s, for some functions 
ô' (6) and f’(@) on ©. Integration and the fact that p(-; 6) is a density whose support 
does not depend on the explicit choice of 0 € © provide the implication “=>”. For 
the implication “<=” we study for A = T 


f] 
0 = Eg È log p(Y; o| = [eoo — B'(0)) p(y; 0)dv(y) = 8'(@)EolA(Y)] — 8'0). 


In that case we have y (0) = Eg[A(Y)] = 8'(0)/8' (0). Moreover, we have equality 
in the Cauchy—Schwarz inequality. This finishes the proof. o 


The single-parameter EF fulfills the properties of Proposition 3.15 with ô (0) = 0 
and 6(0) = x«(0), and decision rule A(y) = T(y) attains the Cramér—Rao 
information bound for y (6) = «' (0). 

We give a multi-dimensional version of the Cramér—Rao information bound. 


Theorem 3.16 (Multi-Dimensional Version of the Cramér—Rao 
Information Bound, Without Proof) Assume that the distributions P(-; 0), 
0 € O, have densities p(-; 0) for a given o-finite measure v on Y, and that 
© C RÝ is an open convex set such that the set {y; p(y; 0) > 0} does not 
depend on 0 € ©. Let A(Y) be unbiased for y : © — R having finite 
second moment. Under additional regularity conditions, see Theorem 7.3 in 
Section 2.7 of Lehmann [244], we have 


Varo (A(¥)) > (Voy (0))'Z(0)~ | Voy (0), 


with (positive definite) Fisher’s information matrix T(0) = (T1, j (0))i<i, j<k 
given by 


G) 0 
Li ; (0) = Eg | —1 Y; 6)—1 Y;6)|, 
1, j O) aE og p( 36; og p( | 


forl <l,j <k. 


3.3 Unbiased Estimators 61 


Remarks 3.17 


e Whenever an unbiased decision rule A(Y) for y(@) meets the Cramér—Rao 
information bound it is UMVU. Thus, it minimizes the risk function R(0, A) 
being based on the square loss L(0,a) = (y(@) — a)” among all unbiased 
decision rules, because unbiasedness for y (0) gives R(0, A) = Varg(A(Y)). 

e The regularity conditions in Theorem 3.16 include that Fisher’s information 
matrix Z(@) is positive definite. 

e Under additional regularity conditions we have the following identity for Fisher’s 
information matrix 


T0) = Ep | (Vo log PY; 8) (Vo log pŒ; 0)" | = -Eo| Vp log pv; 6)| € RM, 


Thus, Fisher’s information matrix can either be calculated from a quadratic 
form of the score s(0, Y) = Vo log p(Y; @) or from the Hessian Vo of the 
log-likelihood £y (0) = log p(Y; 0). Since the score has mean zero, Fisher’s 
information matrix is equal to the covariance matrix of the score s(6, Y). 


In many situations we do not work under the canonical parametrization 6. 
Considerations then require a change of variable. Assume that 


ceR Bb O0=0(¢) ER, 


such that all derivatives 00;(¢)/0¢; exist for 1 < l < kand1 < j < r. The Jacobian 
matrix is given by 


3 
Io) = (=a) Ee REY, 
gj 1</<k,1<j<r 
Fisher’s information matrix w.r.t. € is given by 
k ; ð a rxr 
T" (¢) = ( 40(¢) lao vero: 9(¢))—— log p(Y; sol) eR™, 
dg) gj I<I, j<r 
and we have the identity 
T*(o) = JET TOE J). (3.16) 


This formula is used quite frequently, e.g., in generalized linear models when 
changing the parametrization of the models. 
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3.3.2 Information Bound in the Exponential Family Case 


The purpose of this section is to summarize the Cramér—Rao information bound 
results for the EF and the EDF, since these families play a distinguished role in 
statistical and actuarial modeling. 


Cramér—Rao Information Bound in the EF Case 


We start with the EF case. Assume we have i.i.d. observations Yj,..., Y, having 
densities w.r.t. a o-finite measure v on R given by the EF, see (2.2), 


dF(y;0) = f(y; 0)dv(y) = exp {O° T(y) — K0) +a(y)} dv), 


for canonical parameter @ € © C R*. We assume to work under a minimal 
representation implying that the cumulant function « is strictly convex on the 
interior O, see Assumption 2.6. Moreover, we assume that the cumulant function 
k is steep in the sense of Theorem 2.19. Consider the (aggregated) statistics of the 
joint EF P = {P(-; 0); 0 € ©} 


n n T. 
ye so) & (È TiO), Sno) e RÓ. (3.17) 
i=1 i=1 


We calculate the score of this EF 
n 
s(0, Y) = Vo log p(Y; 0) = Vo (r > T (Yj) - meo) = S(Y) —nVoxk (0). 
i=l 


An immediate consequence of Corollary 2.5 is that the expected value of the score 
is zero for any 0 € ©. This then reads as 


= Eo [T (Y1)] = Eo [S(Y)/n] = Vok (0) € RX. (3.18) 


Thus, the statistics S(Y)/n is an unbiased decision rule for the mean u = Vox (0), 
and we can study its Cramér—Rao information bound. Fisher’s information matrix 
is given by the positive definite matrix 


I) =I, (0) = Eg [s@. Ys, ¥)"| = 2o Ve log p; 6) | = nV2K(0) € REX, 


Note that the multi-dimensionally extended Cramér—Rao information bound in 
Theorem 3.16 applies to the individual components of vector u = VgKk(O@) € 
IR‘. Assume we would like to estimate its j-th component, set y;@) = uj = 
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(Vek (0)); = 0«(0)/00;, for 1 < j < k. This corresponds to the j-th component 
S;(Y) of the statistics S(Y). We have unbiasedness of $;(Y)/n for yj(0) = uj = 
(Vx (@)) ;, and this unbiased statistics attains the Cramér—Rao information bound 


1 — 
Vary ($j(¥)/n) = = (vex) , = (Vov ZO) Voy O). 819 
Recall that Z(@)~! scales as n—!, see (3.15). This provides us with the following 
corollary. 


Corollary 3.18 Assume Y,,...,Y, are i.i.d. and follow an EF (under a 
minimal representation). The components of the statistics S(Y)/n are UMVU 
for yj (0) = 0k (0@)/00;, 1 < j < k and@ € O, with 


Vi Ls) Z (0) 
aro = j Tna? 3 


The corresponding covariance terms are for 1 < j,1 < k given by 


1 1 1 
Cove (+s. tso) = (0). 
nN n 


—- —«K 
n 30;ðð; 


The UMVU property stated in Corollary 3.18 is, in general, not related to MLE, 
but within the EF there is the following link. We have (subject to existence) 


OM” = aremax p(Y; 6) = arg max (© sw) = nx) =h (zsm) ; 
F J n 


0cO 6cO 
(3.20) 


where h = (Vox) ~! is the canonical link of this EF, see Definition 2.8; and where 
we need to ensure that a solution to (3.20) exists; e.g., the solution to (3.20) might 
be at the boundary of © which may cause problems, see Example 3.5.' Because the 
cumulant function « is strictly convex (in a minimal representation), we receive the 


' Another example where there does not exist a proper solution to the MLE problem (3.20) is, for 
instance, obtained within the 2-dimensional Gaussian EF if we have only one single observation Y4. 
Intuitively this is clear because we cannot estimate two parameters from one observation T(Y,) = 
Yı, ¥7). 
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MLE for the mean parameter u = Eg [T (1) ] 


AME = argmax (ATSO) — ne(h(ji))) = Ls), 
eM 


the dual parameter space M = Vox (©) C R* has been introduced in Remarks 2.9. 
If S(Y)/n is contained in M, then this MLE is a proper solution; otherwise, because 
we have assumed that the cumulant function « is steep, the MLE exists in the closure 
M, see Theorem 2.19, and it is UMVU for u, see Corollary 3.18. 


Corollary 3.19 (Balance Property) Assume Y,,..., Yn are i.i.d. and follow 
an EF with @ € © and T(Y;) € M, a.s. The MLE QM € M is UMVU for 
4, and it fulfills the balance property on portfolio level, i.e., 


n 


X Epu [T (Y;)] =n" = s(¥). 
i=1 


Remarks 3.20 


° The balance property is a very important property in insurance pricing because it 
implies that the portfolio is priced on the right level: we have unbiasedness 


9 È E QMLE ror = Eo [(S(Y)] = nu. (3.21) 


e We emphasize that the balance property is much stronger than unbiased- 
ness (3.21), note that the balance property provides unbiasedness even if Y 
follows a completely different model, i.e., even if the chosen EF P is completely 
misspecified. 


e In general, the MLE ou is not unbiased for 0. E.g., if the canonical link 
h = (Vox)! is strictly concave, we have from Jensen’s inequality, subject to 
existence at the boundary of ©, 


i 1 1 
iy [on] = Ey E (=sm)| <h ( i [-sern) —h(u) =8. 


(3.22) 


e The statistics S(Y) is a sufficient statistics of Y, this follows from the factoriza- 
tion criterion; see Theorem 1.5.2 of Lehmann [244]. 
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Cramér—Rao Information Bound in the EDF Case 


The single-parameter linear EDF case is very similar to the above vector-valued 
parameter EF case. We briefly summarize the main results in the EDF case. 

Recall Example 3.5: assume that Y;,..., Y, are independent having densities 
w.r.t. a o-finite measures on R (not being concentrated in a single point) given by, 
see (2.14), 


yi0 — k (0) 


Y; ~ FOr 8, uo = exp | ™ ous 


+a(yi; w/o], (3.23) 


for 1 < i < n. Note that these random variables are not i.i.d. because they may differ 
in the exposures v; > 0. The MLE of u = x«’(6), 8 € O, is found by, see (3.5), 


PX Yih(u) —K(h "uY; 
MLE = arg max 5 a) K( a) — =- dist Vi ři , (3.24) 
iM i g/vi et vi 
we assume that « is steep to ensure @MME e M. The convolution formula of 
Corollary 2.15 says that the MLE QM = Y, belongs to the same EDF with the 
same canonical parameter 0 and the same dispersion g, only the weight changes to 


v= Xi vi. 


Corollary 3.21 (Balance Property) Assume Yi. Pee adre independent with 
EDF distribution (3.23) for 0 € © and Y; € M, a.s. The MLE PME € M is 
UMVU for u = x’ (0), and it fulfills the balance property on portfolio level, 
i.e., 


n n 
mae MLE wss 2 vp = yy u; Yj 
j=l 


The score in this EDF is given by 


n n 


s(0,Y) = Ž log p: ga 36 oe (OY; -0)=} a m- O). 


= i=1 


Of course, we have Eg[s(6, Y)] = 0 and we receive Fisher’s information for 0 € Ò 


n 


0 Vi " 
Z(6) = — li log p(¥; D| = 25“ (0) > 0. (3.25) 


i=l 
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Corollary 2.15 gives for the variance of the MLE 


MLE) P oneg _ "OD? _ (HO)/90)" 
eg ao ae 


This verifies that RME meets the Cramér—Rao information bound and is UMVU 
for the mean u = k'(6). 


Example 3.22 (Poisson Case) For this example, we consider independent Poisson 
random variables N; ~ Poi(vjA). In Sect. 2.2.2 we have seen that Y; = N;/v; can 
be modeled within the single-parameter linear EDF framework using as cumulant 
function the exponential function « (0) = e? , and setting w; = v; and g = 1. Thus, 
the probability weights of a single observation Y; are given by, see (2.15), 


JOR 0, vi) = exp fvi (Oy; — e°) +a Oi; vi}. 


with canonical parameter 90 = log(A) € © = R. The MLE in the mean 
parametrization is given by, see (3.24), 


n n 
MLE _ ini iYi = X; Ni 
m — n 


Xi Vi visi vi 


This estimator is unbiased for à. Having independent Poisson random variables we 
can calculate the variance of this estimator as 


€e M = [0, œ). 


À 
Ða Yi l 


Moreover, from Corollary 3.21 we know that this estimator is UMVU for à, which 
can easily be seen, and uses Fisher’s information (3.25) with dispersion parameter 


g=l1 


Var (=) = 


E 32 n n 
T(0) = -Es E log p(Y; | DORT 
i=l i=l 


One could study many other properties of decision rules (and corresponding 
estimators), for instance, admissibility or uniformly minimum risk equivariance 
(UMRE), and we could also study other families of distribution functions such as 
group families. We refrain from doing so because we will not need this for our 
purposes. 
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3.4 Asymptotic Behavior of Estimators 


All results above have been based on a finite sample Y, = (%,..., Yn)! , we adda 
lower index n to Y, to indicate the finite sample size n € N. The aim of this section 
is to analyze properties of decision rules when the sample size n tends to infinity. 


3.4.1 Consistency 


Assume we have an infinite sequence of observations Y;, i > 1, which allows us 
to construct an infinite sequence of decision rules An = An(Yn), n > 1, where 
An always considers the first n observations Y, = (Y%,..., Yn)! ~ Py(-3 4), for 
0 € © not depending on n. To fix ideas, one may think of i.i.d. random variables Y;. 


Definition 3.23 (Consistency) The sequence A, = A,(Y,) € R’,n > 1, is 
consistent for y : © —> R” if for all 0 € © and for all € > 0 we have 


lim Po [||An(Yn) — ¥ @)ll2 > €] =O. 
n—-> oo 


Definition 3.23 says that A,(Y,) converges in probability to y (0) as n — ov. If 
we (even) have a.s. convergence, we call A,,n > 1, strongly consistent for y : © > 
R”. Consistency is a minimal property that decision rules should fulfill. Typically, in 
applications, this is not enough, and we are interested in (fast) rates of convergence, 
i.e., we would like to know the error rates between A,(Y,) and y(@) forn — oo. 


Example 3.24 (Consistency of the MLE in the EF) We revisit Corollary 3.19 and 
consider an i.i.d. sequence of random variables Y;,i > 1, belonging to an EF, and 
we assume to work under a minimal representation and to have a steep cumulant 
function k. The MLE for u is given by the statistics 
1 1 n — 
AME = -S(¥n)=—) TO), T E M. 
n na 
We add a lower index n to the MLE to indicate the sample size. The i.i.d. property 
of Y;, i > 1, implies that we can apply the strong law of large numbers which tells 
us that we have limy- oo Tila = Ep [T (Y1)] = Vox (0) = un, a.s., for all 0 € ©. 
This implies strong consistency of the sequence of MLEs pe. n > 1, for u. 
We have seen that these MLEs are also UMVU for n, but if we transform them 


to the canonical scale oe they are, in general, biased for 0, see (3.22). However, 


since the cumulant function « is strictly convex (under a minimal representation) 


d : AMLE i F ` 
we receive limn—>oo 0, = 0, a.s., which provides strong consistency also on the 


canonical scale. E 
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Proposition 3.25 Assume the real-valued random variables Y;, i > 1, are 
itd. F(-; 0) distributed with fixed 0 € ©. The resulting empirical distributions 
F, n > l, are given by (3.9). Assume Q is a Fisher-consistent functional for y (0), 
i.e., Q(F(;0)) = y (0) for all 0 € ©. Moreover, assume that Q is continuous in 
F(-; 0), forall0 € ©, w.r.t. the supremum norm. The functionals O(Fn), n > l, are 


consistent for y (0). 


Sketch of Proof The Glivenko—Cantelli theorem [64, 159] says that the empirical 
distribution F, converges uniformly to F(-; 0), a.s., for n — oo. Using the 
assumptions made, we are allowed to exchange the corresponding limits, which 
provides consistency. o 


In view of Proposition 3.25, we discuss the case of the MLE of 6 € ©. In 
Example 3.10 we have seen that the MLE of 0 € © is obtained from a Fisher- 
consistent functional Q for 0 on the set of probability distributions % given by 


O(F) = arg max f log f(y; Pd F (y) = arg min Dx (fll f(s 9), 
0 0 


in the second step we assumed that F has a density f w.r.t. a o -finite measure v on 
R. 

Assume we have i.i.d. data Y; ~ f(;0),i > 1. Thus, the true data generating 
distribution is described by the parameter 0 € ©. MLE requires the study of the 
log-likelihood function (we scale with the sample size n) 


Z 1 a 12 z 
0 —£y (0) = — l Y;; 0). 
e £r, @ = =) log fis 8) 


i=1 


The law of large numbers gives us, a.s., 


1 n 7 X 
lim pe f (Yi; 0) = Ex [log FY; 0]. (3.26) 


Thus, if we are allowed to exchange the arg max operation and the limit inn — oo 
we receive, a.S., 


lg a 
lim gue = lim (remat X log fi; D) 
n—>oo n—>oo a n a 


2 DO Le ~ 
as gs (im _ 5 log f (Y;; D) 


i=1 


= arg max Eg [log SO; 0)] = Q(F(;0) = 8. (3.27) 
6 
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That is, we receive consistency of the MLE for 6 if we are allowed to exchange the 
arg max operation and the limit in — oo. This requires regularity conditions on 
the considered family of distributions F = {F(-; 0); 0 € ©}. The case of a finite 
parameter space © = {01,..., 0J} is easy, this is a simplified version of Wald’s 
[374] consistency proof, 


1 n 1 n 1 n 
Po, Ç g arg max Yo log fY; J <J P, 2 $ log fO: 8) > =) log fis 8) |. 
k i=1 kAj i=1 i=1 


The right-hand side converges to 0 as n — oo for all & # 0j, which gives 
consistency. For regularity conditions on more general parameter spaces we refer 
to Section 5.2 in Van der Vaart [363]. Basically, one needs that the arg max of the 
limiting function given on the right-hand side of (3.26) is well-separated from other 
large values of that function, see Theorem 5.7 in Van der Vaart [363]. 


Remarks 3.26 


e The estimator from the arg max operation in (3.27) is also called M-estimator, 
and (y,a) + log(f(y; a)) plays the role of a scoring function (similar to 
a loss function). The the last line of (3.27) says that this scoring function is 
strictly consistent for the functional Q : F — ©, and Fisher-consistency of 
this functional Q implies 


zo [log fY; 6)] < Eo [log fY; OCF; 9)))] = Eo [log f (Y; 0)], 


for all ĝ € ©. Strict consistency of loss and scoring functions is going to be 
defined formally in Sect. 4.1.3, below, and we have just seen that this plays an 
important role for the consistency of M-estimators in the sense of Definition 3.23. 

e Consistency (3.27) assumes that the data generating model Y ~ F belongs to 
the specified family F = {F(-; 6); 0 € ©}. Model uncertainty may imply that 
the data generating model does not belong to F. In this situation, and if we are 
allowed to exchange the arg max operation and the limit in n in (3.27), the MLE 
will provide the model in F that is closest in KL divergence to the true model F. 
We come back to this in Sect. 11.1.4, below. 


3.4.2 Asymptotic Normality 


As mentioned above, typically, we would like to have stronger results than just 
consistency. We give an introductory example based on the EF. 


Example 3.27 (Asymptotic Normality of the MLE in the EF) We work under the 
same EF as in Example 3.24. This example has provided consistency of the sequence 
of MLEs yE n > 1, for u. Note that the i.i.d. property together with the finite 
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variance property immediately implies the following convergence in distribution 
Jn (ane = n) => NO, V2«@)) È NOTO) asn > o, 


where 0 = 0 (u) = (Vak)! (u) € © for u € M, and N denotes the Gaussian 
distribution. This is the multivariate version of the central limit theorem (CLT), and 
it tells us that the rate of convergence is 1/,/n. This asymptotic result is stated in 
terms of Fisher’s information matrix under parametrization 0. We transform this 
to the dual mean parametrization and call Fisher’s information matrix under the 
dual mean parametrization Z¥ (jw). This involves the change of variable u +> 0 = 
O(u) = (Vox)! (u). The Jacobian matrix of this change of variable is given by 
J(u) =h (0(w))~! and, thus, the transformation of Fisher’s information matrix 
gives, see also (3.16), 


wre Thu) =I)’ TO(w)) J(u) = Ti (0(u)) 


This allows us to express the above CLT w.r.t. Fisher’s information matrix corre- 
sponding to u and it gives us 


ee = n) = N (0, TED!) es (3.28) 


We conclude that the appropriately normalized MLE ea converges in distri- 


bution to the centered Gaussian distribution having as covariance matrix the inverse 
of Fisher’s information matrix Ty (u), and the rate of convergence is 1/,/n. 

Assume that the effective domain © is open, and that 0 = (u) € ©. This 
allows us to transform asymptotic normality (3.28) to the canonical scale. Consider 
again the change of variable u > 0 = 0 (u) = (Vox)! (w) with Jacobian matrix 
J(u) = Ti (0(w))~! = Ti (u). Theorem 1.9 in Section 5.2 of Lehmann [244] tells 
us how the CLT transforms under such a change of variable, namely, 


Ja (8p — 8) = Va (Voo (AMM) — (va) (3.29) 


>N Q JDT IW) SN Q T@)') Mp ee 


We have exactly the same structural form in the two asymptotic results (3.28) 


and (3.29). There is a main difference, poe is unbiased for u whereas, in general, 


Al 


MLE 
0 


n is not unbiased for @, but we receive the same asymptotic behavior. a 
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There are many different versions of asymptotic normality results similar 
to (3.28) and (3.29), and the main difficulty often is to verify the assumptions made. 
For instance, one can prove asymptotic normality based on a Fisher-consistent 
functional Q. The assumptions made are, among others, that Q needs to be Fréchet 
differentiable in P(-; 0) which, unfortunately, is rather difficult to verify. We make 
a list of assumptions here that are easier to check and then we give a version of the 
asymptotic normality result which is stated in the book of Lehmann [244]. This list 
of assumptions in the one-dimensional case © C R reads as follows: 


(i) © C Ris an open interval (possibly infinite). 

(ii) The real-valued random variables Y; ~ F(-; 6), i > 1, have common support 
T= {y Ee R; fy; 6) > 0} which is independent of 0 € O. 

(iii) For every y € F, the density f(y; 0) is three times continuously differentiable 
in 8. 

(iv) The integral f f(y; @)dv(y) is twice differentiable under the integral sign. 

(v) Fisher’s information satisfies Z4 (0) = Eg[(d log f (Y1; 6)/30)7| e (0, œ). 

(vi) For every 69 € © there exist a positive constant c and a function M (y) (both 
may depend on 69) such that Eg, [M (Y1)] < oo and 


a 
= log f (y; 0) 


303 


3 
| <M(y) for all y € Y and 0 € (6o — c, 6o + c). 


Theorem 3.28 (Theorem 2.3 in Section 6.2 of Lehmann [244]) Assume Y;, 
i > l, are i.i.d. F(-; 0) distributed satisfying (i)—(vi) from above. Assume that 
On = On (Yn), n > I, is a sequence of roots that solves the score equations 


es 3 a E 
== lo Y;; 0) Sy (0) = 0, 
98 2 g f Yn 6) = ety, @) 


and which is consistent for 0, i.e. this sequence of roots On(Yn) converges in 
probability to the true parameter 0. Then we have asymptotic normality 


M-e) >N (OnO) asn > ov. (3.30) 


Sketch of Proof Fix 0 € © and consider a Taylor expansion of the score ey, (-) in 
6 for Oy. It is given by 


A po 1 a 
Uy, On) = ly, (©) + 44, © (On — 8) + 5Y, On) (On -0% 
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for 6, € [6, On]. Since 6, is a root of the score, the left-hand side is equal to zero. 
This allows us to re-arrange the above Taylor expansion as follows 
ney 09) 


“Ta n) (On — 8) 


n Yn 


The enumerator on the right-hand side converges in distribution to M (0, Z1 (0)), 
see (18) in Section 6.2 of [244], the first term in the denominator converges in 
probability to Zı (0), see uv? in Section 6.2 of [244], and in the second term of 
the denominator we have 5 aly f On ) which is bounded in probability, see (20) in 
Section 6.2 of [244]. The dna then follows from Slutsky’s theorem. 

oO 


Remarks 3.29 


e A sequence @n)n>1 satisfying Theorem 3.28 is called efficient likelihood esti- 
mator (ELE) of 6. Typically, the sequence of MLEs gMLE gives such an 
ELE sequence, but there are counterexamples where this is not the case, see 
Example 3.1 in Section 6.2 of Lehmann [244]. In that example gMLE exists for 
all n > 1, but it converges in probability to oo, regardless of the value of the true 
parameter 0. 

e Any sequence of estimators that fulfills (3.30) is called asymptotically efficient, 
because, similarly to the Cramér—Rao information bound of Theorem 3.13, it 
attains Z;(0)—! (which under certain assumptions is a lower variance bound 
except on Lebesgue measure zero, see Theorem 1.1 in Section 6.1 of Lehmann 
[244]). However, there are two important differences here: (1) the Cramér— 
Rao information bound statement needs unbiasedness of the decision rule, 
whereas (3.30) only requires consistency (but not unbiasedness nor asymptoti- 
cally vanishing bias); and (2) the lower bound in the Cramér—Rao statement is 
an effective variance (on a finite sample), whereas the quantity in (3.30) is only 
an asymptotic variance. Moreover, any other sequence that differs in probability 
from an asymptotically efficient one less than o(1/,/n) is asymptotically effi- 
cient, too. 

e If we consider a differentiable function 0 > y(@), then Theorem 3.28 implies 


102 
Vn (y (@,) -y@)) => N (o, a ) asn —> 00. (3.31) 
1 


This follows from asymptotic normality, consistency and considering a Taylor 
expansion around 6. 
e We were starting from the MLE problem 


1 
@MLE = eg ibs FM ð). (3.32) 
i=1 
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In statistical theory a parameter estimator that is obtained through a maximiza- 
tion operation is called M-estimator (for maximizing or minimizing), see also 
Remarks 3.26. If the log-likelihood is differentiable in ĝ we can turn the above 
problem into a root search problem for 6 


toi 8 
= \ = log f (Yi; ) = 0. (3.33) 
n i=1 30 


If a parameter estimator is obtained through a root search problem it is called 
Z-estimator (for equating to zero). The Z-estimator (3.33) does not require a 
maximum of the original function, but only a critical point; this is exactly what 
we have been exploring in Theorem 3.28. More generally, for a sufficiently nice 
function w(-; 0) a Z-estimator QZ for 0 is obtained by solving the following 
equation for 6 


L X y; =0, (3.34) 
i=1 


for i.i.d. data Y; ~ F(-; 0). Suppose that the first moment of (¥;; 6) exists. The 
law of large numbers gives us, a.s., see also (3.26), 


_ le ee ~ 
jim, = 2s 0) = Eo [Y Y; 0]. (3.35) 
i= 
Consistency of the Z-estimator OZ, n > 1, for @ is related to the right-hand 
side of (3.35) being zero for 9 = 0. Under additional regularity conditions (and 
consistency) it then holds asymptotic normality 


a r 2 
Jn (8% m 6) >N (o ae) AST oo (3.36) 
a [wv Y; 4)] 


For rigorous statements we refer to Theorems 5.21 and 5.41 in Van der Vaart 
[363]. A modification to the regression case is given in Theorem 1 1.6 below. 


Example 3.30 We consider the single-parameter linear EF for given strictly convex 
and steep cumulant function « and w.r.t. a o-finite measure v on R. The score 
equation gives requirement 


1 ! 
zS Yn) = «'(0) = Eo[Y1]. (3.37) 


Strict convexity implies that the right-hand side strictly increases in 6. Therefore, 
we have at most one solution of the score equation here. We assume that the 
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effective domain © C R is open. It is easily verified that assumptions (ii)—(vi) 
hold, in particular, (vi) saying that the third derivative should have a uniformly 
bounded integrable bound holds because the third derivative is independent of y and 
continuous in 0. With probability converging to 1, (3.37) has a solution 6, which 
is unique, consistent and Theorem 3.28 holds. Note that in Example 3.5 we have 
mentioned the Poisson case which can be degenerate. For the asymptotic normality 
result we use here that this degeneracy asymptotically vanishes with probability 
converging to one. a 


Remark 3.31 (Multi-Dimensional Extension) For an extension of Theorem 3.28 to 
the multi-dimensional case © C RÝ we refer to Section 6.4 in Lehmann [244]. The 
assumptions made in the multi-dimensional case do not essentially differ from the 
ones in the 1-dimensional case. 
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Chapter 4 A) 
Predictive Modeling and Forecast gasii 
Evaluation 


In the previous chapter, we have fully focused on parameter estimation 0 € © and 
the estimation of functions 0 +> y(@) by exploiting decision rules A for estimating 
Yn tb ô = A(Yn) or Yn œ> Y(O) = A(Y,), respectively. The derivations in 
that chapter analyzed the quality of decision rules in terms of loss functions which 
compare, e.g., the action 6 = A(Y,,) to the true parameter 0. The Cramér—Rao 
information bound considers this in terms of a square loss function. In actuarial 
modeling, parameter estimation is only part of the problem, and the second part is 
to predict new random variables Y. These new random variables should be thought 
as claims in the future that we try to predict (and price) using decision rules being 
developed based on past information Y, = (Y,..., Yn)!. In this case, we would 
like to study how a decision rule A(Y;,) generalizes to new data Y, and we then 
call the decision rule rather a predictor for Y. This capability of suitable decision 
rules to generalize to new (unseen) data is analyzed in Sect. 4.1. Such an analysis 
often relies on (numerical) techniques such as cross-validation, which is examined 
in Sect. 4.2, or the bootstrap technique, being presented in Sect. 4.3, below. In this 
chapter, we denote past observations by Y, = (Yj,..., Yn)! supported on Y, and 
the (real-valued) random variables to be predicted are denoted by Y with support 
YV CR. Often we have = YVx---x V. 


4.1 Generalization Loss 


We start by considering the most commonly used expected generalization loss 
(GL) which is the mean squared error of prediction (MSEP). The MSEP is based 
on the square loss function, and it can be seen as a distribution-free approach to 
measure expected GL. In subsequent sections we will study distribution-adapted 
GL approaches. Expected GL measurement with MSEP is considered to be general 
knowledge and we do not give a specific reference in this section. Distribution- 
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adapted versions are mainly based on the strictly consistent scoring framework of 
Gneiting—Raftery [163] and Gneiting [162]. In particular, we will discuss deviance 
losses in Sect. 4.1.2 that are strictly consistent scoring functions for mean estimation 
and, hence, provide proper scoring rules. 


4.1.1 Mean Squared Error of Prediction 


We denote by Y, = (%,..., A (past) observations on which predictors and 
decision rules A : Y — A are based on. The new observation that we would like 
to predict is denoted by Y having support Y C R. In the previous chapter we have 
used decision rule the A(Y,,) to estimate an unknown quantity y (0). In this section 
we will use this decision rule to directly predict the new (unseen) observation Y. 


Theorem 4.1 (Mean Squared Error of Prediction, MSEP) Assume that 
Y, and Y are independent. Assume that the predictor A: Y > A CR, 
Yn  A(Y_) has finite second moment, and that the real-valued random 
variable Y has finite second moment, too. The MSEP of predictor A to predict 
Y is given by 


E| - AY,’ | = CELY] - ELA? a)? + Var(AWn)) + Var(¥). 
(4.1) 


Proof of Theorem 4.1 We compute 


2| (A(¥n) — ELY] + uY] —¥)"] 


[Aan — ¥)?| 


=E[(A@,) - Ety)?|] +E [LY] - ¥)| 
+2 E[(A(,) — ELY)) ŒL] — Y)] 


E [em - S[A(¥n)] +E AY a)l- AW)? | + Varo) 


= (E[Y] — E [AŒ n)? + Var(A(Yn)) + Var), 


where on the second last line we use the independence between Y, and Y. This 
finishes the proof. o 


Remarks 4.2 (Expected Generalization Loss) 


e The quantity E[(Y — A(Yn))*] is an expected GL because it measures how well 
the decision rule (predictor) A(Y,,) generalizes to new (unseen) data Y. As loss 
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function we use the square loss function 
L:YxA>R,, Q.a) Ly, a) = Y-a”. (4.2) 
Therefore, this expected GL is called MSEP. 


e MSEP (4.1) is called expected GL. If we condition on Y,, then we call it GL. For 
the square loss function the GL (conditional MSEP) is given by 


i| = AY)? Yn] = (ELV) = An)? + Var), (4.3) 


where we have used independence between Y and Y,. 

e We do not distinguish the terms ‘prediction’ and ‘forecast’. Sometimes the 
literature makes a subtle difference between the two, the latter involving a 
temporal component and the former not. In the context of prediction/forecasting 
a loss function (4.2) is also called scoring function. We also use these two terms 
interchangeably in the context of prediction/forecasting. 

e The MSEP in Theorem 4.1 decouples into three terms: 


— The first term (E [Y] — z [AY D? is the (squared) bias. Obviously, good 
decision rules A(Y,,) under the MSEP should be unbiased for E[Y]. If we 
compare this to the previous chapter, we note that now the bias is measured 
w.r.t. the mean of the new observation Y. Additionally, there might be a slight 
difference to the previous chapter if Y, and Y do not belong to the same 
parameter 0 € © (if we work in a parametrized family): the risk function 
in (3.3) considers R(0, A) = Eg[L(6, A(Y,))] with both components of the 
loss function L belonging to the same parameter value 6. For the MSEP we 
replace 0 in L(@, A(Y,,)) by the new observation Y that might originate from 
a different distribution (or from a randomized 0 in a Bayesian case). 

— The second term Var(A(Y,,)) is called estimation variance or statistical error. 

— The last term Var(Y) is called process variance or irreducible risk. It reflects 
the pure randomness received from the fact that we try to predict random 
variables Y with deterministic means E[Y]. 


e All three terms on the right-hand side of (4.1) are non-negative. The MSEP 
optimal predictor for Y is its expected value E[Y]. For this choice, the first two 
terms (squared bias and estimation variance) vanish, and we are only left with 
the irreducible risk. Since this MSEP optimal predictor is typically unknown it 
is replaced by a decision rule A(Y,,) that is based on past experience Y,,. This 
decision rule is used to predict Y, but it can also be seen as an estimator for 

{[Y]. A good decision rule A(Y,,) is unbiased for E[Y], making the first term on 

the right-hand side of (4.1) equal to zero, and at the same time trying to make 

the estimation variance small. Typically, this cannot be achieved simultaneously 
and, therefore, there is a trade-off between bias and estimation variance in most 
applied statistical problems. 
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e We emphasize that in financial applications we typically aim for unbiased 
estimators for E[Y], we especially refer to Sect.7.4.2 that studies the balance 
property in network regression models under a stationary portfolio assumption. 
Here, this stationarity may, e.g., translate into a (stronger) i.i.d. assumption on 
Y\,..., Yn, Y. Unbiasedness then implies that the predictor A(Y,,) is optimal 
in (4.1) if it meets the Cramér—Rao information bound, see Theorem 3.13. 


Theorem 4.1 considers the MSEP which implicitly assumes that the square loss 
function is the objective (scoring) function of interest. The square loss function may 
be considered as being distribution-free, but it is motivated by a Gaussian model for 
Y,„ and Y, respectively; this will be justified in Remarks 4.6, below. If we use the 
square loss function for observations different from Gaussian ones it might under- 
or over-weigh particular characteristics in these observations because they may not 
look very Gaussian (e.g. more heavy-tailed). Therefore, we should always choose a 
scoring function that fits the problem considered, for instance, a square loss function 
is not appropriate if we model claim counts following a Poisson distribution. We 
close this section with the example of the EDF. 


Example 4.3 (MSEP Within the EDF) We choose a fixed single-parameter linear 
EDF satisfying Assumption 2.6 and having a steep cumulant function x, see 
Theorem 2.19 and Remark 2.20. Assume we have independent random variables 
Y1,..-, Yn, Y belonging to this EDF having densities, see Example 3.5, 


yið — k (0) 


+a(yi; io} ; (4.4) 
/vi 


Yi ~ fi; 0, vi/p)= exp| 


and similarly for Y ~ f(y; 0, v/g). Note that all random variables share the same 
canonical parameter 0 € ©. The MLE of u € M based on Y, = (Y%,..., Y,)" is 
found by solving, see (3.4)—(3.5), 


DME — @MLE(y,,) = arg max Cy, (XZ) (4.5) 
eM 
7 L YAO — KhA) 
= arg max c, 
eM i=l 9/ Vi 


with canonical link h = (x’ 1 Since the cumulant function « is strictly convex and 
assumed to be steep, there exists a unique solution RME e M. If RME € M we 
have a proper solution providing OMLE — h(™"®) € ©, otherwise 7M provides 
a degenerate model. This decision rule Y, > qe = ped.) is now used 
to predict the (independent) new random variable Y and to estimate the unknown 
parameters 6 and ju, respectively. That is, we use the following predictor for Y 


Y, > ¥ =Es[¥] =Egmel¥] = a" = 7" y,). 
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Note that this predictor Y is used to predict an unobserved (new) random variable 
Y, and it is itself a random variable as a function of (independent) past observations 
Y„. We calculate the MSEP in this model. Using Theorem 4.1 we obtain 


i IG = pe) = (Ee [Y] - Eo Gaii + Varg (AME) + Varg (Y) 


ok" (0) 4 Te) 


/ / 2 
= (K'(0) — K'(0)) + DAR g 


(4.6) 


O KOD? gk" (0) 
~ T(6) v’ 


see (3.25) for Fisher’s information Z (0). In this calculation we have used that the 
MLE QM is UMVU for u = «'(0) and that Y, and Y come from the same 
EDF with the same canonical parameter 0 € Ò. As a result, we are only left 
with estimation variance and process variance, moreover, the estimation variance 
asymptotically vanishes as }“""_, vj —> 00. E 


4.1.2 Unit Deviances and Deviance Generalization Loss 


The main estimation technique used in these notes is MLE introduced in Def- 
inition 3.4. At this stage, MLE is un-related to any specific scoring function L 
because it has been received by maximizing the log-likelihood function. In this 
section we discuss the deviance loss function (as a scoring function) and we 
highlight its connection to the Bregman divergence introduced in Sect. 2.3. Based 
on the deviance loss function choice we rephrase Theorem 4.1 in terms of this 
scoring function. A theoretical foundation to these considerations will be given in 
Sect. 4.1.3, below. 

For the derivations in this section we rely on the same single-parameter linear 
EDF as in Example 4.3, having a steep cumulant function x. The MLE of u = xK (8) 
is found by solving, see (4.5), 


n ~ ~ 

A aj Y;h —K(h —_ 

AME — qMLEvy,) = argmax > iha) — (ha) Parr 
eM j=1 p/dj 


with canonical link h = (k’)~!. This decision rule Y, œ MME = ZMLE(y,,) 
is now used to predict the (new) random variable Y and to estimate the unknown 
parameters 0 and u, respectively. We aim at studying the expected GL under a 
distribution-adapted loss function choice potentially different from the square loss 
function. Below we will justify this second choice more extensively. 
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For the saturated model the common canonical parameter 0 of the independent 
random variables Y1, ..., Y, in (4.4) is replaced by individual canonical parameters 
0i, 1 < i < n. These individual canonical parameters are estimated with individual 
MLEs. The individual MLEs are given by, respectively, 

ae Yh) and AM = Y; e M, 
the latter always exists because of strict convexity and steepness of «. Since the 


MLE ME = Y; maximizes the log-likelihood, we receive for any u € M the 
inequality 


0<2 (logs (Yi; h (Yi) , vi/p) — log f (Yi; h(w), vi Jo) ) 
= 22 (Vh Y) -K AD) -Yih 0) +e HD) SH 
p 


The function (y, u) œ> (y, u) > O is the unit deviance introduced in (2.25), 
extended to €, and it is zero if and only if y = u, see Lemma 2.22. The latter 
is also an immediate consequence of the fact that the MLE is unique within EDFs. 


Remark 4.4 The unit deviance 0(y, u) has only been considered on €x M 
in (2.25). Having steepness of cumulant function « implies € = M, see Theo- 
rem 2.19, and in the absolutely continuous EDF case, we always have Y; € M, a.s., 
which makes (4.7) well-defined for all observations Y;, a.s. In the discrete or the 
mixed EDF case, an observation Y; can be at the boundary of M. In that case (4.7) 
must be calculated from 


dY) =2 (se [Vi — « (8)] — Yih (u) + « (h a») (4.8) 
0cO 

This applies, e.g., to the Poisson or Bernoulli cases for observation Y; = 0, in these 

cases we obtain unit deviances 2u and —2log(1 — u), respectively. 


The previous considerations (4.7)-(4.8) have been studying one single obser- 
vation Y; of Y„. Aggregating over all observations in Y„ (and additionally using 
independence between the individual components of Y,,) we arrive at the so-called 
deviance loss function 
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def. | 


n 


yt 0%, w) (4.9) 
p 


i= 


D(Yn, u) 


2 n ; 
= EE = (Yih Y) = « (h YD) = Yih Ww) + A D) ) > 0. 
an 


=l 


The deviance loss function D(Y„, u) subtracts twice the log-likelihood £y, (u) 
from the one of the saturated model. Thus, it introduces a sign flip compared to (4.5). 
This immediately gives us the following corollary. 


Corollary 4.5 (Deviance Loss Function) The MLE problem (4.5) is equiva- 
lent to solving 


TE = argmax fy, (fi) = argmin D(Y,, ñ). (4.10) 
eM eM 
Remarks 4.6 


e Formula (4.10) replaces a maximization problem by a minimization problem 
with objective function D(Y,„, u) being bounded below by zero. We can use 
this deviance loss function as a loss function not only for parameter estimation, 
but also as a scoring function for analyzing GLs within the EDF (similarly to 
Theorem 4.1). 

e We draw the link to the KL divergence discussed in Sect. 2.3. In formula (2.26) 
we have shown that the unit deviance is equal to the KL divergence (up to 
scaling with factor 2), thus, equivalently, MLE aims at minimizing the average 
KL divergence over all observations Y, 


Que. — arg min >> Du (fC h(Y;), vi/9)|| FC: 6, vi/9)), 
0c i=l 
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by finding an optimal parameter @MLE somewhere ‘in the middle’ of the 
observation gMLE = AY isss gMLE = h(Y,). This then provides us with, 
see (2.27), 


IT (Yi; 9, vi/p) = [Ireno wole Ei pN) (4.11) 


i=l i=l 
X exp |- YD F- h(Yi), vi/9)|| FC: 6, we)| ; 
i=l 


where « highlights that we drop all terms that do not involve 6. This describes the 
change in joint likelihood by varying the canonical parameter ĝ over its domain 
©. The first line of (4.11) is in the spirit of minimizing a weighted square loss, but 
the Gaussian square is replaced by the unit deviance 0. The second line of (4.11) 
is in the spirit of information geometry considered in Sect. 2.3, where we try to 
find a canonical parameter 6 that has a small KL divergence to the n individual 
models being parametrized by h(Y}),...,4(Yn), thus, the MLE @MLE provides 
an optimal balance over the entire set of (independent) observations Y1, ..., Yn 
w.r.t. the KL divergence. 

In contrast to the square loss function, the deviance loss function D(Y ny, u) 
respects the distributional properties of Y,,, see (4.11). That is, if the underlying 
distribution allows for larger or smaller claims, this fact is appropriately valued 
in the deviance loss function (supposed that we have chosen the right family of 
distributions; model uncertainty will be studied in Sect. 11.1, below). 

Assume we work in the Gaussian model. In this model we have «x (8) = 67 /2 
and canonical link h(jz) = m, see Sect. 2.1.3. This provides unit deviance in the 
Gaussian case 0 (y, y) = (y — u’, which is exactly the square loss function for 
action space A = M. Thus, the square loss function is most appropriate in the 
Gaussian case. 

As explained above, we use unit deviances 0(y, u) as a measure of discrepancy. 
Alternatively, as in the introduction to this section, see (4.6), we can consider 
Pearson’s x7-statistic which corresponds to the weighted square loss function 


O- u? 


xX? = 
(y, u) Va)” 


(4.12) 


where u +» V(j) is the variance function of the chosen EDF. Similarly, to 
the deviance loss function (4.9), we can aggregate these Pearson’s x?-statistics 
X 2(Y;, u) over all observations Y; in Y, to receive a second overall measure of 
discrepancy. In the Gaussian case the deviance loss and Pearson’s x7-statistic 
coincide and have a x?-distribution, for other distributions asymptotic results are 
available. 

In the non-Gaussian case, (4.12) is not always robust. For instance, if we 
work in the Poisson model, we have variance function V (u) = u. Our examples 
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below will have low claim frequencies which implies that u will be small. The 
appearance of a small u in the denominator of (4.12) will imply that Pearson’s 
x°-statistic is not very robust in small frequency applications, in particular, if we 
need to estimate this u from Y,. Therefore, we refrain from using (4.12). 


Naturally, in analogy to Theorem 4.1 and derivation (4.6), the above consider- 
ations motivate us to consider expected GLs under unit deviances within the EDF. 
We use the decision rule QME(y n) € A = M to predict a new observation Y. 


The expected deviance GL is defined and given by 
Eo [o (x, mE n)] 


= Bo D (Y, w)] + 2 Eg [YRO — « hD) — YRAMEYY,)) + « (hG@MEY,)))| 


= Eo [d (Y, 1+ (uaF), (4.13) 


the last identity uses independence between Y,, and Y, and with estimation 
risk function 


E (m, RE Yn)) = Eo [o (m EO )| > 0, 4.14) 


we use steepness of the cumulant function, € = conv (T) = M, and Lemma 2.22 
for the strict positivity of the estimation risk function. Thus, for the estimation risk 
function € we replace Y by m in the unit deviance and the expectation Eg is only 
over the observations Y,,. This looks like a very convincing generalization of the 
MSEP, however, one needs to ensure that all terms in (4.13) exist. 


Theorem 4.7 (Expected Deviance Generalization Loss) Assume that Yn 
and Y are independent and belong to the same linear EDF having the same 
canonical parameter 0 € © and having strictly convex and steep cumulant 
function x. Choose a predictor A: Y > A = M, Yn |> A(Yn) and assume 
that all expectations in the following formula exist. The expected deviance GL 
of predictor A to predict Y is given by 


Eo [O (Y, A(Yn))] = Eo [0 (Y, u)] + E (u, A(Yn)) > Eo [O (Y, 1)]. 
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Remarks 4.8 


e Eloy, )] plays the role of the pure process variance (irreducible risk) of 
Theorem 4.1. This term does not involve any parameter estimation bias and 
uncertainty because it is based on the true parameter 6 and uw = x’(6), 
respectively. In Sect. 4.1.3, below, we are going to justify the appropriateness 
of this object as a tool for forecast evaluation. In particular, because the unit 
deviance is strictly consistent for the mean functional, the true mean u = (0) 
minimizes Eg[0(Y, u)], see (4.28), below. 

¢ The second term € (u, A(Y;,)) measures parameter estimation bias and uncer- 
tainty of decision rule A(Y,,) versus the true parameter u = «'(0). The first 
remark is that we can do this for any decision rule A, i.e., we do not necessarily 
need to consider the MLE. The second remark is that we can no longer get a clear 
cut differentiation between a bias term and a parameter estimation uncertainty 
term for deviance loss functions not coming from the Gaussian distribution. We 
come back to this in Remarks 7.17, below, where we give more characterization 
to the individual terms of the expected deviance GL. 

e An issue in applying Theorem 4.7 to the MLE decision rule A(Y,) = @™*(Y,) 
is that, in general, it does not lead to a finite estimation risk function. For instance, 
in the Poisson case we have with positive probability @M“"(Y,,) = 0, which 
results in an infinite estimation risk. In order to avoid this, we need to bound 
away the decision rule form the boundary of M and ©, respectively. In the 
Poisson case this can be achieved by considering a decision rule A(Y;,) = 
max{7ME(y,,), e} for a fixed given e € (0,4 = x’(0)). This decision rule 
has a bias which asymptotically vanishes as n —> oo. Moreover, consistency and 
asymptotic normality tells us that this lower bound does not affect prediction for 
large sample sizes n (with large probability). 

e Similar to (4.3), we can also consider the deviance GL, given Y,. Under 
independence of Y,, and Y we have deviance GL 


Eg [0 (Y, A(¥n))| Yn] = Eo [0 (Y, 4)| Yn] + 0(u, A(Yn)) (4.15) 
> Eo [0(Y, »)]. 


Thus, here we directly compare A(Y,,) to the true parameter u. 


Example 4.9 (Estimation Risk Function in the Gaussian Case) We consider the 
Gaussian case with cumulant function «x (6) = 6? /2 and canonical link h(w) = u. 
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The estimation risk function is in the Gaussian case for a square integrable predictor 
A(Y n) given by 


E (u, AY n)) = Eo [0 (H, AW) 
= 2( uht) — K (AD) — WED [h(AYn))] + Eo [ ACAO) ) 


= w — 2u Eo [A(Y,)] + Eo [awr] 


= (u — Eg [A(¥n)])? + Varo (A (Y n)). 


These are exactly the squared bias and the estimation variance, see (4.1). Thus, in the 
Gaussian case, the MSEP and the expected deviance GL coincide. Moreover, adding 
a deterministic bias c € R to A(Y,,) increases the estimation risk function, supposed 
that A(Yn) is unbiased for u. We emphasize the latter as this is an important 
property to have, and we refer to the next Example 4.10 for an example where this 
property fails to hold. E 


Example 4.10 (Estimation Risk Function in the Poisson Case) We consider the 
Poisson case with cumulant function «x (0) = e? and canonical link h(u) = logun. 
The estimation risk function is given by (subject to existence) 


E (u, A(Yn)) = 2(ulog(n) — p — HE [log(A(Yn))] + Eo [AŒn)]). (4.16) 


Assume that decision rule A(Y,,) is non-deterministic and unbiased for u. Using 
Jensen’s inequality these assumptions imply for the estimation risk function 


E (u, AWn)) = 2u (10g) — Ey [log(AW,))]) > 0. 


We now add a small deterministic bias c € R to the unbiased estimator A(Y,,) for 
u. This gives us estimation risk function, see (4.16) and subject to existence, 


E (u, AWn) + ©) = 2(ulogtu) — WEs [log(AYn) + 0)] + c). 


Consider the derivative w.r.t. bias c in 0, we use Jensen’s inequality on the last line, 


1 
a 2(-u k [o] 4 1) 


F 1 


1 


= (u, AYn) +c) 
dc 


c=0 
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Thus, the estimation risk becomes smaller if we add a small bias to the (non- 
deterministic) unbiased predictor A(Y,,). This issue has been raised in Denuit et 
al. [97]. Of course, this is a very unfavorable property, and it is rather different from 
the Gaussian case in Example 4.9. It is essentially driven by the fact that parameter 
estimation is based on a finite sample, which implies a strict inequality in (4.17) 
for the finite sample estimate A(Y,,). A conclusion of this example is that if we use 
expected deviance GLs for forecast evaluation we need to insist on having unbiased 
predictors. This will become especially important for more complex regression 
models, see Sect. 7.4.2, below. 

More generally, one can prove this result of a smaller estimation risk function for 
a small positive bias for any EDF member with power variance function V (u) = u? 
with p > 1, see also (4.18) below. The proof uses the Fortuin—Kasteleyn—Ginibre 
(FKG) inequality [133] providing t@[A(Yn)!-?] < Ee[A(Yn)]Ep[A(¥n)~?] = 
uEo[lA(¥„) P] to receive (4.17) for power variance parameters p > 1. | 


Remarks 4.11 (Conclusion from Examples 4.9 and 4.10 and a Further Remark) 


e Working with expected deviance GLs for evaluating forecasts requires some care 
because a bigger bias in the (finite sample) estimate A(Y,,) may provide a smaller 
estimation risk function E(u, A(Y;,)). For this reason, we typically insist on 
having unbiased predictors/forecasts. The latter is also an important requirement 
in financial applications to guarantee that the overall price is set to the right level, 
we refer to the balance property in Corollary 3.19 and to Sect. 7.4.2, below. 

e In Theorems 4.1 and 4.7 we use independence between the predictor A(Y,) 
and the random variable Y to receive the split of the expected deviance GL 
into irreducible risk and estimation risk function. In regression models, this 
independence between the predictor A(Y,,) and the random variable Y may 
no longer hold. In that case we will still work with the expected deviance GL 

fo[0(Y, A(Y7))], but a clear split between estimation and forecasting will no 

longer be possible, see Sect. 4.2, below. 


The next example gives the most important unit deviances in actuarial modeling. 


Example 4.12 (Unit Deviances) We give the most prominent examples of unit 
deviances within the single-parameter linear EDF. We recall unit deviance (2.25) 


D, u) = 2 (YAO) = K hO) = yh) + « (hW)) = 0. 


In Sect. 2.2 we have met the examples given in Table 4.1. 
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Table 4.1 Unit deviances of selected distributions commonly used in actuarial science 


Distribution Cumulant function « (0) Unit deviance d(y, y) 

Gaussian 67/2 O= wW? 

Gamma —log(—0) 2 (O — y)/u + log(u/y)) 

Inverse Gaussian —/—20 O — u)? / uy) 

Poisson e? 2 (u — y — ylog(u/y)) 

Negative-binomial —log(1 — e?) 2 (vlog (z) — (y + Dlog (= )) 
2-p 

Tweedie’s CP apor, pe(,2) 2p = ae) 

Bernoulli log(1 + e?) | 2(—ylogu — (1 — y)log( — u)) 


If we focus on Tweedie’s distributions having power variance functions V (u) = 
u”, see Table 2.1, we get a unified expression for the unit deviances for p € {0} U 
(1, 2) U (2, oo) 


l=p _ ,,l—p 2—p — 42-7 
X H y H 

vow =2(y2 TE) 
-=P 2- p 


_ 2/ yP yw? r E) 
(=-p@—p) l-p 2-p 
For the remaining power variance cases we have: p = 1 corresponds to the Poisson 
case, p = 2 gives the gamma case, the cases p < 0 do not have a steep cumulant 
function, and, moreover, there are no EDF models for p € (0, 1), see Theorem 2.18. 


The unit deviance in the Bernoulli case is also called binary cross-entropy. 
This binary cross-entropy has a categorical generalization, called multi-class cross- 


(4.18) 


entropy. Assume we have a categorical EF with levels {1,..., k + 1} and corre- 
sponding probabilities pı1,..., pkx+ı E (0,1) summing up to 1, see Sect. 2.1.4. 
We denote by Y = (liy=1}, Dyer)! €e R*+! the indicator variable that 


shows which level the categorical random variable Y takes; Y is called one-hot 
encoding of the categorical random variable Y. Assume y is a realization of Y and 
set u = p = (Ppj,.--, Pe+i)'. The categorical (multi-class) cross-entropy loss 
function is given by 


k+1 


dy. u) = Uy, p) = -29 yjlogp; > 0. (4.19) 
j=l 


This cross-entropy is closely related to the KL divergence between two categorical 
distributions p and q on {1,...,k +1}. The KL divergence from p to q is given by 


k+l i k+1 k+l 
Dx (qilp) = È` qjlog (4) =) ajlosa; — Do ajlogp;. 
j=l J j=l j=l 
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If we replace the true (but unknown) distribution g by observation Y = y we receive 
unit deviance (4.19) (scaled by 2), and the MLE is obtained by minimizing this KL 
divergence, see also Example 3.10. a 


Outlook 4.13 In the regression modeling, below, each response Y; will have its own 
mean parameter u; = (8, xi) which will be a function of its covariate information 
x;, and B denotes a regression parameter to be estimated with MLE. In that case, 
we modify the deviance loss function (4.9) to 


n n 


1 i 1 i 
Bro Dn, B= DIT OM mi) = OH, MBE), (4.20) 


i=l i=l 
and the MLE of £ can be found by solving 


p = argmin D(Yp, B). (4.21) 
B 


If Y is a new response with covariate information x and following the same EDF as 
Yn, we will evaluate the corresponding expected scaled deviance GL given by 


v MLE 
Up | —o (Y, ; ; 4.22 
[z (Y.u@ »)| (4.22) 


where Eg is the expectation under the true regression parameter B for Y, and Y. 
This will be discussed in Sect. 5.1.7, below. If we interpret (Y, x, v) as a random 
vector describing a randomly selected insurance policy from our portfolio, and being 
independent of Y,, (and the corresponding covariate information x;, | < i < n), 


then a will be independent of (Y, x, v). Nevertheless, the predictor u@ T, x) 
will introduce dependence between the chosen decision rule and Y through x, and 
we no longer receive the split of the expected deviance GL as stated in Theorem 4.7, 
for a related discussion we also refer to Remarks 7.17, below. 

If we interpret (Y, x, v) as a randomly selected insurance policy, then the 
expected GL (4.22) is evaluated under the joint (portfolio) distribution of (Y, x, v), 


and the deviance loss D (Y yn, PT5) is an (in-sample) empirical version of (4.22). m 


4.1.3 A Decision-Theoretic Approach to Forecast Evaluation 


We present an excursion to a decision-theoretic approach to forecast evaluation. 
This excursion gives the theoretical foundation to the unit deviance considerations 
from above. This section follows Gneiting [162], Kriiger—Ziegel [227] and Denuit 
et al. [97], and we refrain from giving complete proofs in this section. Forecast 
evaluation should involve consistent loss/scoring functions and proper scoring rules 
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to encourage the forecaster to make careful assessments and honest forecasts. 
Consistent loss functions are also a necessary tool to receive consistency of M- 
estimators, we refer to Remarks 3.26. 


Consistency and Proper Scoring Rules 


Denote by € C R the convex closure of the support of a real-valued random variable 
Y, and let the action space be A = €, see also (3.1). Predictions are evaluated in 
terms of a loss/scoring function 


L:€xASR,, O.a) L(y,a) > 0. (4.23) 


Remark 4.14 In (4.23) we assume that the loss function L is bounded below by 
zero. This can be an advantage in applications because it gives a calibration to the 
loss function. In general, this lower bound is not a necessary condition for forecast 
evaluation. If we drop this lower bound property, we rather call L (only) a scoring 
function. For instance, the log-likelihood log( f (y, a)) in (3.27) plays the role of a 
scoring function. 


The forecaster can take the position of minimizing the expected loss to choose 
her/his action rule. That is, subject to existence, an optimal action w.r.t. L is received 
by 


@ = a(F) = argmin Ep [L(Y,a)] = arg min f L(y, ajd F (y). (4.24) 
acA acA € 


In this setup the scoring function L(y, a) describes the loss that the forecaster suffers 
if she/he uses action a € A and observation y € € materializes. Since we do not 
want to insist on uniqueness in (4.24) we rather think of set-valued functionals in 
this section, which may provide solutions to problems like (4.24).! 

We now reverse the line of arguments, and we start from a general set-valued 
functional. Denote by F the family of distribution functions of interest supported 
on €. Consider the set-valued functional 


A: F > P(A), FreACF) CA, (4.25) 


that maps each distribution F € F to a subset 21(F') of the action space A = €, 
that is, an element of the power set P(A). The main question that we want to study 
in this section is the following: can we find a loss function L so that the set-valued 


1 In fact, also for the MLE in Definition 3.4 we should consider a set-valued functional. We have 
decided to skip this distinction to avoid any kind of complication and to not disturb the flow of 
reading. 
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functional 2l is obtained by a loss minimization (4.24)? This motivates the following 
definition. 


Definition 4.15 (Strict Consistency) The loss function L : € x A > R+ is 
consistent for the functional 2: F — P(A) relative to the class F if 


ir [L(Y,a@)] < Er [L(¥,a)], (4.26) 


for all F € F,@ € ACF) anda € A. It is strictly consistent if it is consistent and 
equality in (4.26) implies that a € 2(F). 


As stated in Theorem | of Gneiting [162], a loss function L is consistent for the 
functional 2 relative to the class F if and only if, given any F € F, everya@ € A(F) 
is an optimal action under L in the sense of (4.24). 

We give an example. Assume we start from the functional F œ> 2(F) = Er[Y] 
that maps each distribution F to its expected value. In this case we do not need 
to consider a set-valued functional because the expected value is a singleton (we 
assume that F only contains distributions with a finite first moment). The question 
then is whether we can find a loss function L such that this mean can be received by 
a minimization (4.24). This question is answered in Theorem 4.19, below. 

Next we relate a consistent loss function L to a proper scoring rule. A proper 
scoring rule is a function R : € x F — R such that 


SF LR(Y, F)] < Er [R(W,G6)], (4.27) 


for all F, G € F, supposed that the expectations are well-defined. A scoring rule 
R analyzes the penalty R(y, G) if the forecaster works with a distribution G and 
an observation y of Y ~ F materializes. Proper scoring rules have been promoted 
in Gneiting—Raftery [163] and Gneiting [162]. They are important because they 
encourage the forecaster to make honest forecasts, i.e., it gives the forecaster the 
incentive to minimize the expected score by following his true belief about the true 
distribution, because only this minimizes the expected penalty in (4.27). 


Theorem 4.16 (Gneiting [162, Theorem 3]) Assume that L is a consistent loss 
function for the functional % relative to the class F. For each F € F, letap € A(F). 
The scoring rule 


R:€xF->R, Q, F) > RO, F) = LO, ap), 
is a proper scoring rule. 


Example 4.17 Consider the unit deviance 0(-,-) : € x M — R+ fora given EDF 
F ={F(-:30,v/9); 0€ ©} with cumulant function «. Lemma 2.22 says that under 
suitable assumptions this unit deviance 0 (y, jz) is zero if and only if y = u. We 
consider the mean functional on F 


A: F—>A=M, Fo = F(-; 0, v/o) 1> ACFo) = u (0), 
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where u = u (8) = x’(@) is the mean of the chosen EDF. Choosing the unit deviance 
as loss function we receive for any action a € A, see (4.13), 


£9 [0 (Y, a)] = Eo [0 (Y, )] + 2 Eo [Yh (u) — « (h(w)) — Yh(a) + « (h(a))] 
= Ep [0 (Y, u)] + 2 (uh (u) — K (hW) — uh (a) + « (h(a))) 
= Eg [D (Y, u)] +0 (u, a). 


This is minimized for a = yz and it proves that the unit deviance is strictly consistent 
for the mean functional 2 : Fo > A(Fo) = u(0) relative to the chosen EDF 
F = {F(-; 0, v/g); 0 € O}. Using Theorem 4.16, the scoring rule 

Rix F>R, (y, Fo) > R(y, Fo) = dy, u(0)), 


is a strictly proper scoring rule, that is, 


ig [R(Y, Fo)] = Eo DY, u(0))] < Eo [0(¥, u@))] = Eo [R(Y, F5)]. 


for any ð # 0. We conclude from this small example that the unit deviance is a 
strictly consistent loss function for the mean functional on the chosen EDF, and this 
provides us with a strictly proper scoring rule. a 


In the above Example 4.17 we have chosen the mean functional 
y: F> A=M, Fo = F (-; 0, v/o) => A(Fo) = u (0), 


within a given EDF F = {F(-;6,v/g); 0 € ð). We have seen that 


e the unit deviance 0(., -) is a strictly consistent loss function for the mean 
functional % relative to the EDF F; 

e the function (y, Fg) + R(y, Fo) = 0(y, u(0)) is a strictly proper scoring 
rule for the EDF F, i.e., 


Ey [0(Y, u(0))] < Eo [acy, 2(6))]. 


for any 6 Æ 0. 


The consideration of the mean functional F œ> U(F) = EF [Y ] in Example 4.17 
is motivated by the fact that we typically forecast random variables by their means. 
However, more generally, we may ask the question for which functionals 21 : F —> 
P(A), relative to a given set of distributions F, there exists a loss function L that is 
strictly consistent. 
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Definition 4.18 (Elicitable) The functional % is elicitable relative to a given set of 
distributions F if there exists a loss function L that is strictly consistent for 21 and 
Fi 


Above we have seen that the mean functional is elicitable relative to the EDF 
using the unit deviance loss; expected values relative to F with finite second 
moments are also elicitable using the square loss function. Savage [327] more 
generally identifies the Bregman divergences as being the only consistent scoring 
functions for the mean functional; recall that the unit deviance is a special case of a 
Bregman divergence, see (2.29). We are going to state the corresponding result. 

For a general loss function L we make the following (standard) assumptions: 


(LO) L(y, a) > 0 and we have an equality if and only if y = a; 

(L1) L(y, a) is measurable in y and continuous in a; 

(L2) the partial derivative L(y, a)/da exists and is continuous in a whenever 
axy. 


This then allows us to cite the following theorem. 


Theorem 4.19 (Gneiting [162, Theorem 7]) Let F be the class of distributions on 
an interval € C R having finite first moments. 


e Assume the loss function L : €x A —> R satisfies (LO)-(L2) for interval € = A S 
R. L is consistent for the mean functional relative to the class F of compactly 
supported distributions on € if and only if the loss function L is of Bregman 
divergence form 


Dy(y, a) = YO) — Y (a) — Y'a — a), 


for a convex function w with (sub-)gradient w' on €. 
-+ Ifw is strictly convex on Œ, then the Bregman divergence Dy, is strictly consistent 
for the mean functional relative to the class F on € for which both Er[Y] and 
UFLW(Y)] exist and are finite. 


Theorem 4.19 tells us that Bregman divergences are the only consistent loss 
functions for the mean functional (under some additional assumptions). Consider 
the specific choice Y (a) = a?/2 which is a strictly convex function. For this choice, 
the Bregman divergence is the square loss function Dy (y, a) = (y — a)*/2, which 
is strictly consistent for the mean functional relative to the class F C L?(P). We 
remark that also quantiles are elicitable, the corresponding result is going to be 
stated in Theorem 5.33, below. 

The second bullet point of Theorem 4.19 immediately implies that the unit 
deviance 0(-, -) is a strictly consistent loss function for the mean functional within 
the chosen EDF, see also (2.29) and Example 4.17. In particular, for 0 € Ò 


u = u(0) = arg min Ep [0 (Y, a)]. (4.28) 
aeM 
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Explicit evaluation of (4.28) requires that the true distribution Fg of Y is known. 
Since, typically, this is not the case, we need to evaluate it empirically. Assume 
that the random variables Y; are independent and F¢ distributed, with Fg belonging 
to the fixed EDF providing the corresponding unit deviance 0. Then, the objective 
function in (4.28) is approximated by, a.s., 


1 n i 
Deg S, a) —> Ey | Saw. «| asn—> oo. (4.29) 


i=l 


The convergence statement follows from the strong law of large numbers applied 
to the i.i.d. random variables (Y;, v;), i > 1, and supposed that the right-hand side 
of (4.29) exists. Thus, the deviance loss function (4.9) is an empirical version of the 
expected deviance loss function, and this approach is successful if we can exchange 
the ‘argmin’ operator of (4.28) and the limit n —> oo in (4.29). This closes the circle 
and brings us back to the M-estimator considered in Remarks 3.26 and 3.29, and 
which also links forecast evaluation and M-estimation. 


Forecast Dominance 


A consequence of Theorem 4.19 is that there are infinitely many strictly consistent 
loss functions for the mean functional, and, in principle, we could choose any 
of these for forecast evaluation. Choosing the unit deviance 0 that matches the 
distribution Fg of the observations Y,, and Y, respectively, gives us the MLE ps. 
and we have seen that the MLE AM} is not only unbiased for u = «'(0), but it 
also meets the Cramér—Rao information bound. That is, it is UMVU within the data 
generating model reflected by the true unit deviance 0. This provides us (in the finite 
sample case) with a natural candidate for 0 in (4.29) and, thus, a canonical proper 
scoring rule for (out-of-sample) forecast evaluation. 

The previous statements have all been done under the assumption that there is 
no uncertainty about the underlying family of distribution functions that generates 
Y and Yy, respectively. Uncertainty was limited to the true canonical parameter 6 
and the true mean u(0). This situation changes under model uncertainty. Kriiger— 
Ziegel [227] study the question of having multiple strictly consistent loss functions 
in the situation where there is no natural candidate choice. Different choices may 
give different rankings to different (finite sample) predictors. Assume we have 
two predictors jz; and fiz for a random variable Y. Similarly to the definition of 
the expected deviance GL, we understand these predictors jz; and f2 as random 
variables, and we assume that all considered random variables have a finite first 
moment. Importantly, we do not assume independence between f1, f2 and Y, 
and in regression models we typically receive dependence between predictors {i 
and random variables Y through the features (covariates) x, see also Outlook 4.13. 
Following Kriiger—Ziegel [227] and Ehm et al. [119] we define forecast dominance 
as follows. 
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Definition 4.20 (Forecast Dominance) Predictor £; dominates predictor {£2 if 


i [Dy (Y, fi1)] < E [Dy Y, fi2)], 


for all Bregman divergences Dy with (convex) y supported on €, the latter being 
the convex closure of the supports of Y, f1 and f2. 


If we work with a fixed member of the EDF, e.g., the gamma distribution, then 
we typically study the corresponding expected deviance GL for forecast evaluation 
in one single model, see Theorem 4.7 and (4.29). This evaluation may involve 
model risk in the decision making process, and forecast dominance provides a robust 
selection criterion. 

Kriiger—Ziegel [227] build on Theorem 1b and Corollary 1b of Ehm et al. [119] to 
prove the following theorem (which prevents from considering all convex functions 


y). 


Theorem 4.21 (Theorem 2.1 of Krüger-Ziegel [227]) Predictor tı dominates 
predictor {£2 if and only if for allt € € 


[O — t) gsc] = E[(Y — t) 1g]. (4.30) 


Denuit et al. [97] argue that in insurance one typically works with Tweedie’s 
distributions having power variances V (u) = u?” with power variance parameters 
p = 1. This motivates the following weaker form of forecast dominance. 


Definition 4.22 (Tweedie’s Forecast Dominance) Predictor fè} Tweedie- 
dominates predictor fiz if 


) Dp, Z1)| < ) Dp, ji2)| ’ 


for all Tweedie’s unit deviances 0, with power variance parameters p > 1, we 
refer to (4.18) for p € (1, œo) \ {2} and Table 4.1 for the Poisson and gamma cases 
p € {1,2}. 


Recall that Tweedie’s unit deviances 0, are a subclass of Bregman divergences, 
see (2.29). Define the following function for power variance parameters p > 1 


logu for p = 2, 
Yp(h) = uP : 
7p otherwise. 


Denuit et al. [97] prove the following proposition. 


Proposition 4.23 (Proposition 4.1 of Denuit et al. [97]) Predictor tı Tweedie- 
dominates predictor {12 if 


z [T R] < E[T, f] forall p > 1, 
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and 


T [Yims] > E[YIg@,>r}] forallt € È. 


Theorem 4.21 gives necessary and sufficient conditions to have forecast dom- 
inance, Proposition 4.23 gives sufficient conditions to have the weaker Tweedie’s 
forecast dominance. In Theorem 7.15, below, we give another characterization of 
forecast dominance in terms of convex orders, under the additional assumption that 
the predictors are so-called auto-calibrated. 


4.2 Cross-Validation 


This section focuses on estimating the expected deviance GL (4.13) in cases where 
the canonical parameter 0 is not known. Of course, the same concepts apply to the 
MSEP. In the remainder of this section we scale the unit deviances with v/g, to 
bring them in line with the deviance loss (4.9). 


4.2.1 In-Sample and Out-of-Sample Losses 


The general aim in predictive modeling is to predict an unobserved random variable 
Y as good as possible based on past information Y,,. Within the EDF, the predictive 
performance is then evaluated under an empirical version of the expected deviance 
GL 


zy [2 g, awn] = 2E [= (racy ~ x (h(Y)) — YA(A(Yn)) + K (h(A Y n))) )| . 
(4.31) 


Here, we no longer assume that Y and A(Y„) are independent, and in the dependent 
case Theorem 4.7 does not apply. The reason for dropping the independence 
assumption is that below we consider regression models of a similar type as in 
Outlook 4.13. The expected deviance GL (4.31) as such is not directly useful 
because it cannot be calculated if the true canonical parameter 0 is not known. 
Therefore, we are going to explain how it can be estimated empirically. 

We start from the expected deviance GL in the EDF applied to the MLE decision 
rule pon Ex). It can be rewritten as 


Eo [2 (= a] = J Eo [2 (x, A) Yn = Yn | AP On; 0), 
f í (4.32) 
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where we use the tower property for conditional expectations. In view of (4.32), 
there are two things to be done: 


(1) For given observations Y, = y,, we need to estimate the deviance GL, see 
also (4.15), 


tg E (x. amey,))| Yn = sn = Ep |e (x, aM.) Y, = J 


(4.33) 


This is the part that we are going to solve empirically in the this section. 
Typically, we assume that Y and Y, are independent, nevertheless, Y and 
its MLE predictor may still be dependent because we may have a predictor 
je) = MEY, x). That is, this predictor often depends on covariate 
information x that describes Y, an example is provided in (4.22) of Outlook 4.13 
and this is different from (4.15). In that case, the decision rule A: Y x ¥ > A 
is extended by an additional covariate component x € X, we refer to Sect. 5.1.1, 
where ¥ is introduced and discussed. 

(2) We have to find a way to generate more observations Y, from P(y,; 0) in 
order to evaluate the outer integral in (4.32) empirically. One way to do so is 
the bootstrap method that is going to be discussed in Sect. 4.3, below. 


We address the first problem of estimating the deviance GL given in (4.33). 
We do this under the assumption that Y, and Y are independent. In order to 
estimate (4.33) we need observations for Y. However, typically, there are no 
observations available for this random variable because it is only going to be 
observed in the future. For this reason, one uses past observations for both, model 
fitting and the GL analysis. In order to perform this analysis in a proper way, the 
general paradigm is to partition the entire data into two disjoint data sets, a so- 
called learning data set L = {Y,,..., Yn} and a test data set T = wi, ree YÀ}. 
If we assume that all observations in £ U T are independent, then we receive a 
suitable observation Y, from the learning data set £ that can be used for model 
fitting. The test sample 7 can then play the role of the unobserved random variable 
Y (by assumption being independent of Y,,). Note that £ is only used for model 
fitting and 7 is only used for the deviance GL evaluation, see Fig. 4.1. 

This setup motivates to estimate the mean parameter u with MLE MAE = 
aM F(Y,,) from the learning data £ and Y,, respectively, by minimizing the 
deviance loss function u œ> D(Y,, u) on the learning data £, according to Corol- 
lary 4.5. Then we use this predictor fee to empirically evaluate the conditional 
expectation in (4.33) on 7. The perception used is that we (in-sample) learn a 
model on £L and we out-of-sample test this model on T to see how it generalizes 
to unobserved variables ¥;, 1 <t < T, that are of a similar nature as Y. 
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Fig. 4.1 Partition of entire 
data into learning data set £ Ge 
and test data set T 


Definition 4.24 (In-Sample and Out-of-Sample Losses) The in-sample 
deviance loss on the learning data £ = {Yj,..., Yn} is given by 


D(C, GMLE) = oe - E (Yih Yo) — x (h YD) — Yih) + « (HG) ), 


with MLE AYE = @ME (Y ,) on L. 
The out- T ea deviance loss on the test data 7 = {Y t neal ig) of 
predictor Q} is 


ITM T n (xi) (a (Y5) )- r na (a) ), 
p= 


where the sum runs over the test sample 7 having exposures vi, E vi, > 0. 


For MLE we minimize the objective function (4.9), therefore, the in-sample 
deviance loss D(L, aes = D(Y,, RME (Y,)) exactly corresponds to the 
minimal deviance ines (4.9) achieved on the learning data £, i.e., when using 
MLE QPF = GME(Y,,). We call this in-sample because the same data £ is 
used for parameter estimation and deviance loss calculation. Typically, this loss is 
biased because it uses the optimal (in-sample) parameter estimate, we also refer to 
Sect. 4.2.3, below. 

The out-of-sample loss D(7 , pe) then empirically estimates the inner expec- 
tation in (4.32). This is a proper pro oi analysis because the test data 7 
is disjoint from the learning data £ on which the decision rule TE has been 
trained. Note that this out-of-sample figure reflects (4.33) in the following sense. 
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We have a portfolio of risks Yi, vi), 1 < t < T, and (4.33) does not only reflect 
the calculation of the deviance GL of a given risk, but also the random selection of 
a risk from the portfolio. In this sense, (4.33) is an average over a given portfolio 
whose description is also included in the probability Po. 


Summary 4.25 Definition 4.24 gives the general principle in predictive 
modeling according to which model learning and the generalization analysis 
are done. Namely, based on two disjoint and independent data sets £ and 7, 
we perform model calibration on £, and we analyze (conditional) GLs (using 
out-of-sample losses) on 7, respectively. For this concept to be useful, the 
learning data £ and the test data 7 have to be sufficiently similar, i.e., ideally 
coming from the same model. 

This approach does not estimate the outer expectation in the expected 
deviance GL (4.32), i.e., it is only an estimate for the deviance GL, given 
Y „, see (4.33). 


4.2.2 Cross-Validation Techniques 


In many applications one is not in the comfortable situation of having two 
sufficiently large data sets £ and 7 available to support model learning and an 
out-of-sample generalization analysis. That is, we are usually equipped with only 
one data set of average size, let us call it D. In order to calculate the objects in 
Definition 4.24 we could partition this data set (at random) into two data sets and 
then calculate in-sample and out-of-sample deviance losses on this partition. The 
disadvantage of this approach is that it is an inefficient use of information if only 
little data is available. In that case we require (almost) all data for learning. However, 
we still need a sufficiently large share of data for testing, to receive reliable deviance 
GL estimates for (4.33). The classical approach in this situation is to use cross- 
validation for estimating out-of-sample losses. The concept works as follows: 


1. Perform model learning and in-sample loss calculation D(L£, fees) on all 
available data £ = D, i.e., this part is not affected by selecting test data 7 
and it is not touched by cross-validation. 

2. For out-of-sample deviance loss calculation use the data D iteratively in an 
efficient way such that part of the data is used for model learning and the 
other part for the out-of-sample generalization analysis. This second step 


(continued) 
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is (only) done for estimating the deviance GL of the model learned on all 
data. Le. for prediction we work with MLE Aih, but the out-of-sample 
deviance loss is estimated using this data in a different way. 


The three most commonly used methods are leave-one-out, K -fold and stratified 
K -fold cross-validation. We briefly describe these three cross-validation methods. 


Leave-One-Out Cross- Validation 


Denote all available data by D = {Y\,..., Yn}, and assume independence between 
the components. For leave-one-out (loo) cross-validation we select 1 < i < n and 
define the partition £(_;) = D \ {Y;} for the learning data and 7; = {Y;} for the test 
data. Based on the learning data £(_;) we calculate the MLE 


pe def. AMLE 
no i) n 


which is based on all data except observation Y;. This observation is now used to 
do an out-of-sample analysis, and averaging this over all 1 < i < n we receive the 
leave-one-out cross-validation loss 


Floo _ Ly Zofran zE D) = Z Lar at D) (4.34) 


_2y5 g (iil i) -e AYD -Yih (a D) +«(n(@™))), 


j=] 


where D(T;, RÌ) is the (out-of-sample) cross-validation loss on J; = {Y;} using 
the predictor m’. This leave-one-out cross-validation loss D! is now used as 
estimate for the out-of-sample deviance loss D(T, es Leave-one-out cross- 
validation uses all data D for learning and testing, namely, the data D is partitioned 
into a learning set £,_;) for (partial) learning and a test set 7; = {Y;} for an out- 
of-sample generalization analysis. This is done for all instances 1 < i < n, and the 
out-of-sample loss is estimated by the resulting average cross-validation loss. This 
averaging allows us to not only understand (4.34) as a conditional out-of-sample loss 
in the spirit of Definition 4.24. The outer empirical average in (4.34) also makes it 
suitable for an expected deviance GL estimate according to (4.32). 
The variance of this empirical deviance GL is given by (subject to existence) 


var (BY) = 35 Poco (Fo (a), o (r, =), 
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Fig. 4.2 Partitions of K-fold cross-validation for K = 5 


These covariances use exactly the same observations on D \ {Y;, Yj}, therefore, 
there are strong correlations between the estimators 72“~”) and 7@‘—/). In addition, 
the leave-one-out cross-validation is often computationally not feasible because it 
requires fitting the model n times, which in the situation of complex models and of 
large insurance portfolios can be too demanding. We come back to this in Sect. 5.6 
where we provide the generalized cross-validation (GCV) loss approximation within 
generalized linear models (GLMs). 


K -Fold Cross- Validation 


Choose a fixed integer K > 2 and partition the entire data D at random into K 
disjoint subsets (called folds) £1,..., £x of approximately the same size. The 
learning data for fixed 1 < k < K is then defined by Lj- = D \ Lx and the 
test data by 7, = Lx, see Fig.4.2. Based on learning data £j—x] we calculate the 
MLE 


~[—k] def. ~MLE 
H T VL’ 


which is based on all data except Tx. 

These observations are now used to do an (out-of-sample) cross-validation 
analysis, and averaging this over all 1 < k < K we receive the K-fold cross- 
validation (CV) loss. 
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The last step is an approximation because not all 7; may have exactly the same 
sample size if n is not a multiple of K. We can understand (4.35) not only as a 
conditional out-of-sample loss estimate in the spirit of Definition 4.24. The outer 
empirical average in (4.35) also makes it suitable for an expected deviance GL 
estimate according to (4.32). The variance of this empirical deviance GL is given by 
(subject to existence) 


K 
5) ! Uj apa, U; pes 
Varo (a~) N -z 5 x 5 Cove (= (x: m7 a) f a (S: m n) ; 
kKI=1 YET, YjeTi 
Typically, in applications, one uses K -fold cross-validation with K = 10. 


Stratified K -Fold Cross-Validation 


A disadvantage of the above K -fold cross-validation is that it may happen that there 
are two outliers in the data, and there is a positive probability that these two outliers 
belong to the same subset £g. This may substantially distort K -fold cross-validation 
because in that case the subsets £g, 1 < k < K, are of different quality. Stratified K - 
fold cross-validation aims at distributing outliers more equally across the partition. 
Order the observations Y;, 1 < i < n, as follows 


Ya) = Ya) 2... > Ya). 


For stratified K-fold cross-validation, we randomly distribute (partition) the K 
biggest claims ¥(1),..., ¥(x) to the subsets Lk, 1 < k < K, then we randomly 
partition the next K biggest claims ¥(x+1),..., Yx) to the subsets Lk, 1 < k < K, 
and so forth. This implies, e.g., that the two biggest claims cannot fall into the same 
set Lg. This stratified partition Lg, 1 < k < K, is then used for K-fold cross- 
validation. 
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Summary 4.26 (Cross- Validation) 


e A model is calibrated on the learning data set £ by minimizing the in- 
sample deviance loss D(L, jx) in u. This provides MLE jay", 

e The quality of this model is assessed on test data 7 being disjoint of £ 
considering the corresponding out-of-sample deviance loss D(T, QM). 

e If there is no test data set 7 available we perform (stratified) K-fold 
cross-validation. This provides the (stratified) K -fold cross-validation loss 
DOCV which is an estimate for the out-of-sample deviance loss and for the 


expected deviance GL (4.32). 


Example 4.27 (Out-of-Sample Deviance Loss Estimation) We consider a claim 
counts example using the Poisson EDF model. The claim counts N; and exposures 
vi > 0 used come from the French motor insurance data given in Listing 13.2 
of Chap. 13.1. We model the claim frequencies Y; = Nj; /v; with the Poisson EDF 
model having cumulant function « (0) = exp{0} and dispersion parameter g = 1 for 
all 1 < i < n. The expected frequency is given by u = Eg[Y;] = «'(@). Moreover, 
we assume that all claim counts N;, 1 < i < n, are independent. This provides us 
with the Poisson deviance loss function for observations Y, = (Y1, ..., Yn)! , see 
Example 4.12, 


te i.e 
D(Yn, u) = ae = pa (u- Y; — Yilog (£)) 
l= i= 


1 n 7 
=- 52 (vu — N; — Nilog (*)) > 0, 
n i=1 Ni 


where, for Y; = 0, we set 0(Y; = 0, u) = 2. Minimizing the Poisson deviance 
loss function D(Y,,, p) in u gives us the MLE for u and 6 = h(u), respectively. It 
is given by, see (3.24), 


n 
MLE — MLE — Lia Ni = 7.36%, 
i=l Vi 


for learning data set £ = {Y,,..., Yn}. This provides us with an in-sample Poisson 
deviance loss of D(¥n, RYP) = D(L, RYE) = 25.213 - 107°. 

Since we do not have test data 7, we explore tenfold cross-validation. We 
therefore partition the entire data at random into K = 10 disjoint sets £1, .. . , £10, 
and compute the tenfold cross-validation loss as described in (4.35). This gives us 
DCV = 25.213- 1072, thus, we receive the same value as for the in-sample loss 
which says that we do not have in-sample over-fitting, here. This is not surprising 
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in the homogeneous model A = Eg[Y;]. We can also quantify the uncertainty in this 
estimate by the corresponding empirical standard deviation for 7k = Ly 


he (Ti, R41) — BV)” = 0.234- 1072. (4.36) 


This says that there is quite some fluctuation in the data because uncertainty in 
estimate DEV = 25.213 - 107? is roughly 1%. This finishes this example, and we 
will come back to it in Sect. 5.2.4, below. E 


4.2.3 Akaike’s Information Criterion 


The out-of-sample analysis in terms of GLs and cross-validation evaluates the 
predictive performance on unseen data. Another way of model selection is to study 
in-sample losses instead, but penalize model complexity. Akaike’s information 
criterion (AIC), see Akaike [5], is the most popular tool that follows such a model 
selection methodology. AIC is based on a set of assumptions which should be 
fulfilled to apply, this is going to be discussed in this section; we therefore follow 
the lecture notes of Kiinsch [229]. 

Assume we have independent random variables Y; from some (unknown) density 
f. Assume we have two candidate models with densities hg and gy from which we 
would like to select the preferred one for the given data Y, = (Y%1,..., Yn). The two 
unknown parameters in these densities hg and gy are called 0 and 7, respectively. 
We neither assume that one of the two models hg and gy contains the true model f, 
nor that the two models are nested. That is, f, hg and gy» are quite general densities 
w.r.t. a given o -finite measure v. 

Assume that both models under consideration have a unique MLE @MLE _ 
OMLEvy,,) and MLE = PMLE (Y) which is based on the same observations Y,„. 
AIC [5] says that model hĝme should be preferred over model g3mie if 


—~2) log (hae (¥;)) +2dim() < —2 5 log (ggmu: (Y;)) + 2 dim), 
= a (4.37) 


where dim(-) denotes the dimension of the corresponding parameter. Thus, we 
compute the log-likelihoods of the data Y, in the corresponding MLEs OMLE and 
HMLE and we penalize the resulting values with the number of parameters to correct 
for model complexity. We give some remarks. 
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Remarks 4.28 


AIC is neither an in-sample loss nor an out-of-sample loss to measure gen- 
eralization accuracy, but it considers penalized log-likelihoods. Under certain 
assumptions one can prove that asymptotically minimizing AICs is equivalent 
to minimizing leave-one-out cross-validation mean squared errors. 

The two penalized log-likelihoods have to be evaluated on the same data Y, 
and they need to consider the MLEs OMLE and PMLE because the justification 
of AIC is based on the asymptotic normality of MLEs, otherwise there is no 
mathematical justification why (4.37) should be a reasonable model selection 
tool. 

AIC does not require (but allows for) nested models hg and gy nor need they be 
Gaussian, it is only based on asymptotic normality. We give a heuristic argument 
below. 

Evaluation of (4.37) involves all terms of the log-likelihoods, also those that do 
not depend on the parameters 6 and 7. 

Both models should consider the data Y, in the same units, i.e., AIC does not 
apply if he is a density for Y; and gy is a density for cY;. In that case, one has 
to perform a transformation of variables to ensure that both densities consider 
the data in the same units. We briefly highlight this by considering a Gaussian 
example. We choose i.i.d. observations Y; ~ N (0, o?) for known variance o? > 
0. Choose c > 0, we have cY; ~ N(d = c0, c2a7). We obtain MLE gMLE — 
y-"_, Yi /n and log-likelihood in MLE @MLE 


n 


7 n 1 Auge? 
3 log (have (Y;)) = —5log(2x0°) — 2 55 (vi 9 e , 


On the transformed scale we have MLE PME = $}; cY;/n = cOM™ and 
log-likelihood in MLE ME 


n 


n 
n 1 
X “log (ggm (cY;)) = —zlogQ2nc*o*) = 5 Age (cY; = eon) 
1 


i=l i= 


2 


Thus, find that the two log-likelihoods differ by —nlog(c), but we consider the 
same model only under different measurement units of the data. The same applies 
when we work, e.g., with a log-normal model or logged data in a Gaussian model. 


We give a heuristic justification of AIC. In Example 3.10 we have seen that 


the MLE is obtained by minimizing the KL divergence from hg to the empirical 
distribution fn of Y,. This motivates to use the KL divergence also for comparing 
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the MLE estimated models to the true model, i.e., we consider the difference 
(supposed the densities are defined on the same domain) 


Dri (f |h) — De (F sge) 


= / log (>) f(y)dv(y) — { log (>) f(y)dv(y) 
hme (y) ggmue (y) 


= f log (ggmte(y)) f (y)dv(y) — f log (hgmte(y)) fQ)dv(y). (4.38) 


If this difference is negative, model hme should be preferred over model gme 
because it is closer to the true model f w.r.t. the KL divergence. Thus, we need to 
calculate the two integrals in (4.38). Since the true density f is not known, these 
two integrals need to be estimated. 

As a first idea we estimate the integrals on the right-hand side empirically using 
the observations Y ,, say, the first integral is estimated by 


1 n 
F > log (ezme (Y:)) ; 
i=l 


However, this will lead to a biased estimate because the MLE MLE exactly 


maximizes this empirical estimate (as a function of 1). The integrals in (4.38), 
on the other hand, can be interpreted as an out-of-sample calculation between 
independent random variables Y,, (used for MLE) and Y ~ fdv used in the integral. 
The bias results from the fact that in the empirical estimate the independence 
gets lost. Therefore, we need to correct this estimate for the bias in order to 
obtain a reasonable estimate for the difference of the KL divergences. Under the 
following assumptions this bias correction is asymptotically given by —dim(Ŷ)/n: 
(1) yn (OMLE(y,,) — Vo) is asymptotically normally distributed MN (0, ©(%)~!) as 
n —> œ, where % is the parameter that minimizes the KL divergence from gy to 
f; we also refer to Remarks 3.26. (2) The true f is sufficiently close to gy, such 
that the E ¢-covariance matrix of the score Vyloggy, is close to the negative E p- 
expected Hessian Vi logg; see also (3.36) and Sect. 11.1.4, below. In that case, 
È (Vo) approximately corresponds to Fisher’s information matrix Z4 (o) and AIC is 
justified. 

This shows that AIC applies if both models are evaluated under the same 
observations Y,,, the models need to use the MLEs, and asymptotic normality needs 
to hold with limits such that the true model is close to a member of the selected 
model classes {hg; 0} and {g9; 0}. We remark that this is not the only set-up under 
which AIC can be justified, but other set-ups do not essentially differ. 

The Bayesian information criterion (BIC) is similar to AIC but in a Bayesian 
context. The BIC says that model Ame should be preferred over model g3miz if 


—2 $ log (haue (¥i)) +og(n)dim() < —2 ` log (ggmu: (Y;))+log(n)dim(ð), 


i=l i=1 
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where n is the sample size of Y, used for model fitting. The BIC has been derived 
by Schwarz [331]. Therefore, it is also called Schwarz’ information criterion (SIC). 


4.3 Bootstrap 


The bootstrap method has been invented by Efron [115] and Efron—Tibshirani [118]. 
The bootstrap is used to simulate new data from either the empirical distribution Fa 
or from an estimated model F (-; 0). This allows, for instance, to evaluate the outer 
expectation in the expected deviance GL (4.32) which requires a data model for Y „. 
The presentation in this section is based on the lecture notes of Biihlmann—Machler 
(59, Chapter 5]. 


4.3.1 Non-parametric Bootstrap Simulation 


Assume we have i.i.d. observations Y,,..., Y, from an unknown distribution 
function F(-;@). Based on these observations Y = (Yj,...,Y,) we choose a 
decision rule A : Y — A = © C R which provides us with an estimator for 0 


Y > 6=A(Y). (4.39) 


Typically, the decision rule A(-) is a known function and we would like to determine 
the distributional properties of parameter estimator (4.39) as a function of the 
(random) observations Y. E.g., for any measurable set C, we might want to compute 


Pe [0 € C] = Po [A(Y) € C] = / Ltaoec) dP; 0). (4.40) 


Since, typically, the true data generating distribution Y; ~ F(-; 0) is not known, the 
distributional properties of 6 cannot be determined, also not by Monte Carlo simula- 
tion. The idea behind bootstrap is to approximate F (-; 0). Choose as approximation 
to F(-; 0) the empirical distribution of the i.i.d. observations Y given by, see (3.9), 


be jZ 
Fn(y) = 5 Liy; <y} fory € R. 
i=1 


The Glivenko—Cantelli theorem [64, 159] tells us that the empirical distribution 
F, converges uniformly to F(-; 0), a.s., for n —> œœ, so it should be a good 
approximation to F (-; 0) for large n. The idea now is to simulate from the empirical 
distribution F,. 


4.3 Bootstrap 107 


(Non-parametric) bootstrap algorithm 


(1) Repeat form = 1,..., M 


(a) simulate i.i.d. observations Y;*,..., Y“ from F, (these are obtained by 
random drawings with replacements from the observations Y1, ..., Yn; we 
denote this resampling distribution of Y* = (Yř, ..., Y%) by P* = P}); 

(b) calculate the estimator 0") = A(Y*). 


(2) Return 00», ..., 0") and the resulting empirical bootstrap distribution 


M 
A 1 
Fir) = 37D Laoso) 


m=1 


for the estimated distribution of 6. 


We can use the empirical bootstrap distribution F Py as an estimate of the true 
distribution of @, that is, we estimate and approximate 


M 
= Be tes rs 1 
Po [ec] ~ P [Pec] E n eec] ~ T XO tgmmecy (4.41) 
m=1 


where Pý corresponds to the bootstrap distribution of Step (la) of the above 
aleon tir, and where we set 0* = A(Y*). This bootstrap distribution Py 
empirically approximated by the empirical bootstrap distribution Fy, F* for Suen 
o. 


Remarks 4.29 


e The quality of the approximations in (4.41) depend on the richness of the 
observation Y = (Y1, ..., Yn), because the bootstrap distribution 


Py [0 € C] = Pj, [0 e c], 


depends on the realization y of the data Y from which we generate the bootstrap 
sample ¥*. It also depends on M and the explicit random drawings Y* providing 
the empirical bootstrap distribution Fx. ~. The latter uncertainty can be controlled 
since the bootstrap distribution Py, camesponds to a multinomial distribution, and 
the Glivenko—Cantelli theorem (64, 159] applies to Fy, F* and P% for M — oo. The 
former uncertainty inherited from the realization Y = y Nol be diminished 
because we cannot enrich the observation Y. 
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e The empirical bootstrap distribution F iy can be used to estimate the mean of the 
estimator @ given in (4.39) 


M 
A A 1 
vJ- E] ~ am, 


m=1 
and its variance 
Varo (0) = Varp: (e) "PEL gon») L sae i 
5 Male Me 


e The previous item discusses the approximation of the bootstrap mean and 
variance, respectively. Bootstrap intervals for coverage ratios need some care, 
and there are different versions. The naive way of just calculating quantiles from 
a often does not work well, and methods like a double bootstrap may need to 
be considered. 

e In (4.39) we have assumed that the quantity of interest is the parameter 6, but 
similar considerations also apply to general decision rules estimating y (0). 

e The bootstrap as defined above directly acts on the observations Y1, ..., Yn, and 
the basic assumption is that these observations are i.i.d. If this is not the case, 
one may first need to transform the observations, for instance, one can calculate 
residuals and assume that these residuals are i.i.d. In more complicated cases, one 
even drops the i.i.d. assumption and replaces it by an identical mean and variance 
assumption, that is, that all residuals are assumed to be independent, centered and 
with unit variance. This is sometimes also called residual bootstrap and it may 
be suitable in regression models as will be introduced below. Thus, in this latter 
case we estimate for each observation Y; its mean f; and its standard deviation 
Gi, for instance, using the variance function of the chosen EDF. This then allows 
for calculating the residuals & = (Y; — {4;)/G;. For the residual bootstrap we 
resample the residuals £7 from @1, ...,€,. This provides bootstrap observations 

Y* = fj HE. 

The wild bootstrap proposed by Wu [386] additionally uses a centered and 

normalized i.i.d. random variable V; (also being independent of @*) to modify 

the residual bootstrap observations to 


Ye = Bi + GVM. 
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The bootstrap is called consistent for @ if we have for all z € R the following 
convergence in probability as n —> oo 


Po [Vm (@—8) <<] Py [Va@ -A <z] PS” o, 


the quantities @ = 6, and 6* = ox depend on (the size n of) the observation Y = 
Y,,; the convergence in probability is needed because Y = Y, are random vectors. 


Assume that @MLE = @ is the MLE of 6 satisfying the assumptions of Theorem 3.28. 
Then we have asymptotic normality, see (3.30), 


vn (0 — 0) => N (0.2107!) asn —> œ, 
with Fisher’s information Z1 (0). Bootstrap consistency then requires 
A Py. -] : Pe 
Jn (8 — 0) => N (0. Ti (0) ) in probability as n —> oo. 


Bootstrap consistency typically holds if Gis asymptotically normal (as n — oo) and 
if the underlying data Y; is i.i.d. Moreover, bootstrap consistency usually implies 
consistent variance and bias estimation 


Varp o* ee lF 
arps ( ) prob. i aa Ý [0 ] 0 prob. 
Varo (8 z% [0] — 0 


1 asn —> œ. 


For more information and bootstrap confidence intervals we refer to Chapter 5 in 
the lecture notes of Bühlmann—-Mächler [59]. 


4.3.2 Parametric Bootstrap Simulation 


For the parametric bootstrap we assume to know the parametric family F = 
{F(;0);,0 € ©} from which the i.i.d. observations Y1,..., Y, ~ F(;0) have 
been generated from, and only the explicit choice of the parameter 0 € © is not 
known. Based on these observations we construct an estimator @ = A(Y), for the 
unknown parameter 0 € ©. 


(Parametric) bootstrap algorithm 


(1) Repeat form = 1,..., M 
(a) simulate i.i.d. observations Y*¥, ..., Y% from F(-; 0) (we denote the resam- 
pling distribution of Y* = (Y},..., Y7) by P* = P$); 
(b) calculate the estimator 0°") = A(Y*). 
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(2) Return OU) OM and the resulting empirical bootstrap distribution 


M 
~ 1 
Fy) = M 5 Laoco) 


m=1 


We then estimate and approximate the distribution of A analogously to (4.41), 
and the same remarks apply as for the non-parametric bootstrap. The parametric 
bootstrap has the advantage that it can enrich the data by sampling new observations 
from the distribution F (-; 6). A shortfall of the parametric bootstrap will occur if the 
family F is misspecified, then the bootstrap sample Y* will only poorly describe the 
true data Y, e.g., if the data shows over-dispersion but the select family F does not 
allow to model such over-dispersion. 
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Chapter 5 ® 
Generalized Linear Models Geek for 


Most of the theory in the previous chapters has been based on the assumption of 
having similarity (or homogeneity) between the different observations. This was 
expressed by making an i.i.d. assumption on the observations, see, e.g., Sect. 3.3.2. 
In many practical applications such a homogeneity assumption is not reasonable, 
one may for example think of car insurance pricing where different car drivers have 
different driving experience and they drive different cars, or of health insurance 
where policyholders may have different genders and ages. Figure 5.1 shows a 
health insurance example where the claim sizes depend on the gender and the 
age of the policyholders. The most popular statistical models that are able to 
cope with such heterogeneous data are the generalized linear models (GLMs). The 
notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn 
[283] in 1972. Their work has introduced a unified procedure for modeling and 
fitting distributions within the EDF to data having systematic differences (effects) 
that can be described by explanatory variables. Today, GLMs are the state-of-the- 
art statistical models in many applied fields including statistics, actuarial science 
and economics. However, the specific use of GLMs in the different fields may 
substantially differ. In fields like actuarial science these models are mainly used for 
predictive modeling, in other fields like economics or social sciences GLMs have 
become the main tool in exploring and explaining (hopefully) causal relations. For 
a discussion on “predicting” versus “explaining” we refer to Shmueli [338]. 

It is difficult to give a good list of references for GLMs, since GLMs and their 
offsprings are present in almost every statistical modeling publication and in every 
lecture on statistics. Classical statistical references are the books of McCullagh- 
Nelder [265], Fahrmeir—Tutz [123] and Dobson [107], in the actuarial literature we 
mention the textbooks (in alphabetical order) of Charpentier [67], De Jong—Heller 
[89], Denuit et al. [99-101], Frees [134] and Ohlsson—Johansson [290], but this list 
is far from being complete. 
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Fig. 5.1 Claim sizes in 
health insurance as a function 
of the age of the policyholder, 
and split by gender 
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In this chapter we introduce and discuss GLMs in the context of actuarial 
modeling. We do this in such a way that GLMs can be seen as a building block of 
network regression models which will be the main topic of Chap. 7 on deep learning. 


5.1 Generalized Linear Models and Log-Likelihoods 


5.1.1 Regression Modeling 


We start by assuming of having independent random variables Yj, ..., Y, which 
are described by a fixed member of the EDF. That is, we assume that all Y; are 
independent and have densities w.r.t. a o -finite measure v on R given by 


forl <i <n, 


iOi — k (Oi 
Y; ~ FOr 81 upo) = oxp [ZEEE 5 a; v/o] 
p/ vj 
(5.1) 


with canonical parameters 6; € O, exposures v; > 0 and dispersion parameter g > 
0. Throughout, we assume that the effective domain © has a non-empty interior. 
There is a fundamental difference between (5.1) and Example 3.5. We now allow 
every random variable Y; to have its own canonical parameter 0; € ©. We call 
this a heterogeneous situation because the observations are allowed to differ in a 
systematic way expressed by different canonical parameters. This is highlighted by 
the lines in the health insurance example of Fig. 5.1 where (expected) claim sizes 
differ by gender and age of policyholder. 

In Sect. 4.1.2 we have introduced the saturated model where every observation Y; 
has its own parameter 6;. In general, if we have n observations Y = (Yj,..., Yn)! 
we can estimate at most n parameters. The other extreme case is the homogeneous 
one, meaning that 6; = 0 € Ò for all 1 <i <n. In this latter case we have exactly 
one parameter to estimate, and we call this model null model, intercept model 
or homogeneous model, because all components of Y are assumed to follow the 
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same law expressed in a single common parameter 0. Both the saturated model and 
the null model may behave very poorly in predicting new observations. Typically, 
the saturated model fully reflects the data Y including the noisy part (random 
component, irreducible risk, see Remarks 4.2) and, therefore, it is not useful for 
prediction. We also say that this model (in-sample) over-fits to the data Y and 
does not generalize (out-of-sample) to new data. The null model often has a poor 
predictive performance because if the data has systematic effects these cannot be 
captured by a null model. GLMs try to find a good balance between these two 
extreme cases, by trying to extract (only) the systematic effects from noisy data 
Y. We therefore model the canonical parameters 6; as a low-dimensional function 
of explanatory variables which capture the systematic effects in the data. In Fig. 5.1 
gender and age of policyholder play the role of such explanatory variables. 

Assume that each observation Y; is equipped with a feature (explanatory variable, 
covariate) x; that belongs to a fixed given feature space X. These features x; 
are assumed to describe the systematic effects in the observations Y;, i.e., these 
features are assumed to be appropriate descriptions of the heterogeneity between the 
observations. In a nutshell, we then assume of having a suitable regression function 


6:X% >Ò, xb (x), 
such that we can appropriately describe the observations by 


ind. pee 


Yi ~ fyi; 0i = O(xi), vi/p) = exp +a uo), 
/vi 


(5.2) 


for 1 <i < n. As a result we receive for the first moment of Y; , see Corollary 2.14, 


li = u(xi) = Egy [Yi] = K'(0(x;)). (5.3) 


Thus, the regression function 0 : ¥ —> © is assumed to describe the systematic 
differences (effects) between the random variables Y1, ..., Y, being expressed by 
the means u(x;) for features x1, ...,x,. In GLMs this regression function takes a 
linear form after a suitable transformation, which exactly motivates the terminology 
generalized linear model. 


5.1.2 Definition of Generalized Linear Models 


We start with the discussion of the features x e æ. Features are also called 
explanatory variables, covariates, independent variables or regressors. Throughout, 
we assume that the features x = (x0, X1,..., xq)! include a first component xo = 1, 
and we choose feature space X C {1} x R1. The inclusion of this first component 
xo = 1 is useful in what follows. We call this first component intercept or bias 
component because it will be modeling an intercept of a regression model. The 
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null model (homogeneous model) has features that only consist of this intercept 
component. For later purposes it will be useful to introduce the design matrix X 
which collects the features x1, ..., Xn € ¥ of all responses Y|,..., Y,. The design 
matrix is defined by 


1x11 +++ Xq 
E= (X1, Xn) SoS ve t bere, (5.4) 
1 Xn ++: Xn,q 
Based on these choices we assume existence of a regression parameter B € IRI+! 


and of a strictly monotone and smooth link function g : M — R such that we can 
express (5.3) by the following function (we drop index 7) 


q 
x > g(u(x)) = g (Eoœ) [Y]) = n(x) = (B, x) = Bo + Soe (5.5) 
j=l 


Here, (-,-) describes the scalar product in the Euclidean space RI+!, O(x) = 
h(u(x)) is the resulting canonical parameter (using canonical link h = (k’ yh, 
and n(x) is the so-called linear predictor. After applying a suitable link function g, 
the systematic effects of the random variable Y with features x can be described by 
a linear predictor n(x) = (8, x), linear in the components of x € æ. This gives 
a particular functional form to (5.3), and the random variables Yj,..., Y, share 
a common regression parameter B € RI+!, Remark that the link function g used 
in (5.5) can be different from the canonical link A used to calculate 0 (x) = h(u(x)). 
We come back to this distinction below. 


Summary of (5.5) 


1. The independent random variables Y; follow a fixed member of the 
EDF (5.1) with individual canonical parameters 6; € Ô, forall 1 <i <n. 

2. The canonical parameters 6; and the corresponding mean parameters i 
are related by the canonical link h = (x’ iat as follows h(u;) = 0i, where 
k is the cumulant function of the chosen EDF, see Corollary 2.14. 

3. We assume that the systematic effects in the random variables Y; can 
be described by linear predictors nj = n(x;) = (ß, xi) and a strictly 
monotone and smooth link function g such that we have g(u;) = nj = 
(B, xi), for all 1 < i < n, with common regression parameter B € R4 ae 


We can either express this GLM regression structure in the dual (mean) parameter 
space M or in the effective domain ©, see Remarks 2.9, 
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x w(x) = go! (n(x) = 8° (8, x) © M or 

x +> (x) = (hog ')(n(x)) = (ho g')(B,x) € È, 
where (h o g7!) is the composition of the inverse link g~! and the canonical link h. 
For the moment, the link function g is quite general. In practice, the explicit choice 
needs some care. The right-hand side of (5.5) is defined on the whole real line if at 
least one component of x is both-sided unbounded. On the other hand, M and © 
may be bounded sets. Therefore, the link function g may require some restrictions 
such that the domain and the range fulfill the necessary constraints. The dimension 
of B should satisfy 1 < 1+ q < n, the lower bound will provide a null model and 
the upper bound a saturated model. 


5.1.3 Link Functions and Feature Engineering 


As link function we choose a strictly monotone and smooth function g : M —> R 
such that we do not have any conflicts in domains and ranges. Beside these 
requirements, we may want further properties for the link function g and the features 
x. From (5.5) we have 


u(x) = Eocey [Y] = g7! (B, x). (5.6) 


Of course, a basic requirement is that the selected features x can appropriately 
describe the mean of Y by the function in (5.6), see also Fig.5.1. This may 
require so-called feature engineering of x, for instance, we may want to replace 
the first component x; of the raw features x by, say, A in the pre-processed 
features. For example, if this first component describes the age of the insurance 
policyholder, then, in some regression problems, it might be more appropriate to 
consider age? instead of age to bring the predictive problem into structure (5.6). It 
may also be that we would like to enforce a certain type of interaction between the 
components of the raw features. For instance, we may include in a pre-processed 
feature a component x1 jae which might correspond to weight/height? if the 
policyholder has body weight x; and body height x2. In fact, this pre-processed 
feature is exactly the body mass index of the policyholder. We will come back to 
feature engineering in Sect. 5.2.2, below. 
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Another important requirement is the ability of model interpretation. In insurance 
pricing problems, one often prefers additive and multiplicative effects in feature 
components. Choosing the identity link g(m) = m we receive a model with additive 
effects 


q 
u(x) = Eœ) [Y] = (8, x) = Bo + È Bjx;, 
j=l 


and choosing the log-link g(m) = log(m) we receive a model with multiplicative 
effects 


q 
u(x) = Eo) [Y] = exp(B, x) = e” | | efi, 
j=l 


The latter is probably the most commonly used GLM in insurance pricing because 
it leads to explainable tariffs where feature values directly relate to price de- and 
increases in percentages of a base premium exp{ £o}. 

Another very popular choice is the canonical (natural) link, i.e., g = h = (k’)7!. 
The canonical link substantially simplifies the analysis and it has very favorable 
statistical properties (as we will see below). However, in some applications practical 
needs overrule good statistical properties. Under the canonical link g = h we have 
in the dual mean parameter space M and in the effective domain O, respectively, 


xr u(x) = «'(n(x)) = K'(P, x) and x O(x) = n(x) = (P, x). 


Thus, the linear predictor 7 and the canonical parameter 6 coincide under the 
canonical link choice g = h = (k’)~!. 


5.1.4 Log-Likelihood Function and Maximum Likelihood 
Estimation 


After having a fully specified GLM within the EDF, there remains estimation of the 
regression parameter B € R¢@*!. This is done within the framework of MLE. 


The log-likelihood function of Y = (%,..., in)" for regression parameter 
B € RI+! is given by, see (5.2) and we use the independence between the 
A's, 


(continued) 
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B > a= > [AUE —« hued) Hai 6D 


i=l 


where we set u(x;) = g7! (B, xi). For the canonical link g = h = (x’)~! 
this simplifies to 


n 


B > (B= YO EAB) eB xi tas uo) 68) 


ill 


MLE of ß needs maximization of log-likelihoods (5.7) and (5.8), respectively; 
these are the GLM counterparts to the homogeneous case treated in Section 3.3.2. 
We calculate the score, we set n; = (B, xi) and u; = u (xi) = g7! (B, xi), 


n 


sB. Y) = Vpey(B) =Y - LY; — ui] Veh(u(xi)) 


vi Əh(ui) Obi 
= Y` LIY; — mil L- Vena) (5.9) 
= Ui ON; 
wou Yi ni (S2) x 
<9 V(ui) \ dpi j 
where we use the definition of the variance function V (u) = (k” o h)(u), see 


Corollary 2.14. We define the diagonal working weight matrix, which in general 
depends on £ through the means u; = g7! (B, xi), 


dg(ui)\ vu 1 a 
í 17 J i<i<n 


and the working residuals 


e R” 


3g (ui T 
R= R(Y,B)= (Yo, — m) 


l<i<n 


This allows us to write the score equations in a compact form, which provides the 
following proposition. 
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Proposition 5.1 The MLE for B is found by solving the score equations 
s(B, Y) = Vpey(B) = X'W(B)R(Y, B) = 0. 


For the canonical link g = h = (x')~! the score equations simplify to 
v; 
s(B, Y) = Vg£y (B) = X' diag (=) (Y — «’(XB)) =0, 
p l<i<n 


where «'(XB) € R” is understood element-wise. 


Remarks 5.2 


In general, the MLE of £ is not calculated by maximizing the log-likelihood 
function £y (£), but rather by solving the score equations s(B, Y) = 0; we also 
refer to Remarks 3.29 on M- and Z-estimators. The score equations provide 
the critical points for B, from which the global maximum of the log-likelihood 
function can be determined, supposed it exists. 

Existence of a MLE of £ is not always given, similarly to Example 3.5, we may 
face the problem that the solution lies at the boundary of the parameter space 
(which itself may be an open set). 

If the log-likelihood function B > €y(B) is strictly concave, then the critical 
point of the score equations s(B, Y) = 0 is unique, supposed it exists, and, 
henceforth, we have a unique MLE p~ for B. Below, we give cases where 
the strict concavity of the log-likelihood holds. 

In general, there is no closed from solution for the MLE of B, except in the 
Gaussian case with canonical link, thus, we need to solve the score equations 
numerically. 


Similarly to Remarks 3.17 we can calculate Fisher’s information matrix w.r.t. B 


through the negative expected Hessian of £y (£). 


We get Fisher’s information matrix w.r.t. B 


T(B) = Ep | Vpty(B) (Vpty(B))' | = -Eg | Vpev(B)| = 27W). 
(5.10) 


If the design matrix X € R’*G+) has full rank q +1 < n, Fisher’s 
information matrix Z (£) is positive definite. 
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Dispersion parameter g > 0 has been treated as a nuisance parameter above. 
Its explicit specification does not influence the MLE of £ because it cancels in the 
score equations. If necessary, we can also estimate this dispersion parameter with 
MLE. This requires solving the additional score equation 


a Ton a 
ap O= 2 — Yin) = e (awe) | + as v/o) = 0. 


(5.11) 


and we can plug in the MLE of £ (which can be estimated independently of ø). 
Fisher’s information matrix is in this extended framework given by 


X'W(B)X 0 ) 
o ; 


I(B, p) = -Eg [ Maser B. | z ( —Eg [07¢y(B, v)/d97] 


that is, the off-diagonal terms between £ and ¢ are zero. 


In view of Proposition 5.1 we need a root search algorithm to obtain the MLE 
of B. Typically, one uses Fisher’s scoring method or the iterative re-weighted 
least squares (IRLS) algorithm to solve this root search problem. This is a main 
result derived in the seminal work of Nelder-Wedderburn [283] and it explains the 
popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. 
Fisher’s scoring method/IRLS algorithm explore the updates for t > O until 
convergence 


$5 FPS (x"w@yx) TWO) (28” ia RY, B®), 
(5.12) 


where all terms on the right-hand side are evaluated at algorithmic time t. If we 
have n observations Y = (Yj,..., Yn)! we can estimate at most n parameters. 
Therefore, in our GLM we assume to have a regression parameter B € R¢*+! of 
dimension q + 1 < n. Moreover, we require that the design matrix X has full rank 
q +1 < n. Otherwise the regression parameter is not uniquely identifiable since 
linear dependence in the columns of X allows us to reduce the dimension of the 
parameter space to a smaller representation. This is also needed to calculate the 
inverse matrix in (5.12). This motivates the following assumption. 


120 5 Generalized Linear Models 


Assumption 5.3 Throughout, we assume that the design matrix X € 
R’*@+) has full rank gq +1 <n. 


Remarks 5.4 (Justification of Fisher’s Scoring Method/IRLS Algorithm) 


e We give a short justification of Fisher’s scoring method/IRLS algorithm, for a 
more detailed treatment we refer to Section 2.5 in McCullagh—Nelder [265] and 
Section 2.2 in Fahrmeir—Tutz [123]. 

The Newton—Raphson algorithm provides a numerical scheme to find solu- 
tions to the score equations. It requires to iterate for t > 0 


F(t) FI) AD aah z(t) 
po = B =B +TB")'sB.Y), 


where T (B) =—- Vaey (P) denotes the observed information matrix in B € RI+!. 


The calculation of the inverse of the observed information matrix T BÀ- can 
be time consuming and unstable because we need to calculate second derivatives 
and the eigenvalues of the observed information matrix can be close to zero. A 
stable scheme is obtained by replacing the observed information matrix T (B) 
by Fisher’s information matrix Z(B) = Eg Z (B)] being positive definite under 
Assumption 5.3; this provides a quasi-Newton method. Thus, for Fisher’s scoring 
method we iterate for t > 0 


BO o BO =F +1GO sP, Y), (5.13) 


and rewriting this provides us exactly with (5.12). The latter can also be 
interpreted as an IRLS scheme where the response g(Y;) is replaced by an 
adjusted linearized version Z; = g(mi) + aty, — ui). This corresponds 
to the last bracket in (5.12), and with corresponding weights. 

e Under the canonical link choice, Fisher’s information matrix and the observed 
information matrix coincide, i.e. Z(B) = TB), and the Newton-Raphson 
algorithm, Fisher’s scoring method and the IRLS algorithm are identical. This 
can easily be seen from Proposition 5.1. We receive under the canonical link 
choice 


v E E a (2vu) £ (5.14) 
l<i<n 


= —X'W(p)xX = —T(B). 
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The full rank assumption g + 1 < n on the design matrix X implies that 
Fisher’s information matrix Z(B) is positive definite. This in turn implies that 
the log-likelihood function £y (£) is strictly concave, providing uniqueness of a 
critical point (supposed it exists). This indicates that the canonical link has very 
favorable properties for MLE. Examples 5.5 and 5.6 give two examples not using 
the canonical link, the first one is a concave maximization problem, the second 


one is not for p > 2. 


Example 5.5 (Gamma Model with Log-Link) We study the gamma distribution as a 
single-parameter EDF model, choosing the shape parameter œ = 1/9 as the inverse 
of the dispersion parameter, see Sect. 2.2.2. Cumulant function « (0) = — log(—é@) 
gives us the canonical link 0 = h(w) = —1/. Moreover, we choose the log-link 
n = g(u) = log() for the GLM. This gives a canonical parameter 0 = — exp{—7}. 
We receive the score 


s(B,Y) = Vpely(B) = 2 |2 = J x; = X'diag (=>) R(Y, B). 
l<i<n 


i=l 


Unlike in other examples with non-canonical links, we receive a favorable expres- 
sion here because only one term in the square bracket depends on the regression 
parameter ß, or equivalently, the working weight matrix W does not dependent on 
B. We calculate the negative Hessian (observed information matrix) 


TB) = —V3ly(B) = X" diag (= =) x 
? Hi l<i<n 


In the gamma model all observations Y; are strictly positive, a.s., and under the 
full rank assumption q + 1 < n, the observed information matrix T (B) is positive 
definite, thus, we have a strictly concave log-likelihood function in the gamma case 
with log-link. E 


Example 5.6 (Tweedie’s Models with Log-Link) We study Tweedie’s models for 
power variance parameters p > 1 as a single-parameter EDF model, see Sect. 2.2.3. 
The cumulant function «p is given in Table 4.1. This gives us the canonical link 0 = 
hp(u) = u'-P/(1 — p) < 0 for u > O and p > 1. Moreover, we choose the log- 
link n = g (u) = log(u) for the GLM. This implies 6 = exp{ (1 — p)n}/(.— p) < 0 
for p > 1. We receive the score 


n 

vi Y; — li o fvi 1l 

s(B.Y) = Vply(B) = at = xX! diag (2 =) R(Y, B). 
imi? Hi ? mi I<i<n 
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We calculate the negative Hessian (observed information matrix) for u; > 0 


ii eda (epn) x. 

? Mi 1<i<n 
This matrix is positive definite for p € [1, 2], and for p > 2 it is not positive definite 
because (p—1)Y;—(p—2); may have positive or negative values if we vary u; > 0 
over its domain M. Thus, we do not have concavity of the optimization problem 
under the log-link choice in Tweedie’s GLMs for power variance parameters p > 2. 
This in particular applies to the inverse Gaussian GLM with log-link. a 


5.1.5 Balance Property Under the Canonical Link Choice 


Throughout this section we work under the canonical link choice g = h = (k’)7!. 
This choice has very favorable statistical properties. We have already seen in 
Remarks 5.4 that the derivation of the MLE of B becomes particularly easy under 
the canonical link choice and the observed information matrix T (B) coincides with 
Fisher’s information matrix Z(B) in this case, see (5.14). 

For insurance pricing, canonical links have another very remarkable property, 
namely, that the estimated model automatically fulfills the balance property and, 
henceforth, is unbiased. This is particularly important in insurance pricing because 
it tells us that the insurance prices (over the entire portfolio) are on the right level. 
We have already met the balance property in Corollary 3.19. 


Corollary 5.7 (Balance Property) Assume that Y has independent compo- 
nents being modeled by a GLM under the canonical link choice g = h = 
(x’)~!. Assume that the MLE of regression parameter B € R4+! exists and 


denote it by a: We have balance property on portfolio level (for constant 
dispersion ¢) 


n n n 
AMLE 

a [v: Yi] = 2a oa = z 

l= I= = 


Proof The first column of the design matrix X is identically equal to 1 representing 
the intercept, see (5.4). The second part of Proposition 5.1 then provides for this first 
column of X, we cancel the (constant) dispersion g, 


: MLE ; 
(l, ..., 1) diag(v1,..., un) K (Æ )= (l1,..., 1)diag(vi,..., Vn) Y. 


This proves the claim. o 
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Remark 5.8 We mention once more that this balance property is very strong and 
useful, see also Remarks 3.20. In particular, the balance property holds, even though 
the chosen GLM might be completely misspecified. Misspecification may include 
an incorrect distributional model, not the right link function choice, or if we have 
not pre-processed features appropriately, etc. Such misspecification will imply that 
we have a poor model on an insurance policy level (observation level). However, 
the total premium charged over the entire portfolio will be on the right level 
(supposed that the structure of the portfolio does not change) because it matches 
the observations, and henceforth, we have unbiasedness for the portfolio mean. 


From the log-likelihood function (5.8) we see that under the canonical link choice 
we consider the statistics $(Y) = X! diag(v; /®)i<i<nY € R1+! and to prove the 
balance property we have used the first component of this statistics. Considering all 
components, S(Y) is an unbiased estimator (decision rule) for 


n T 
ig [S(Y)] = X! diag(v;/9)1<i<nk’ (XB) = i TKB, sam) 
i=l O<j<q 


(5.15) 


This unbiased estimator S(Y) meets the Cramér—Rao information bound, hence 
it is UMVU: taking the partial derivatives of the previous expression gives 
VeEg(S(Y)] = Z(8), the latter also being the multivariate Cramér—Rao 
information bound for the unbiased decision rule S(Y) for (5.15). Focusing on 
the first component we have 


Varg > cate [vi a) = Varg (È vi ri) = Ñ pvi V (ui) = 9 (Z(B))o0. 
i=l 


i=l i=l 


(5.16) 


where the component (0, 0) in the last expression is the top-left entry of Fisher’s 
information matrix Z (£) under the canonical link choice. 


5.1.6 Asymptotic Normality 


Formula (5.16) quantifies the uncertainty in the premium calculation of the insur- 
ance policies if we use the MLE estimated model (under the canonical link 
choice). That is, this quantifies the uncertainty in the dual mean parametrization 


in terms of the resulting variance. We could also focus on the MLE a itself 
(for general link function g). In general, this MLE is not unbiased but we have 
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consistency and asymptotic normality similar to Theorem 3.28. Under “certain 
regularity conditions”! we have for n large 


4MLE (d) 


Bn N (B, TaB"), (5.17) 


where p is the MLE based on the observations Y, = (Y1,..., Yn)', and Z, (£) 
is Fisher’s information matrix of Y „, which scales linearly in n in the homogeneous 
EF case, see Remarks 3.14, and in the homogeneous EDF case it scales as pe 1 Vi, 
see (3.25). 


5.1.7 Maximum Likelihood Estimation and Unit Deviances 


From formula (5.7) we conclude that the MLE a of B € RIT! is found by the 
solution of (subject to existence) 


BE = argmax fy(B) = argmax JO T [YihueGas)) — « huD) |, 
i=l 


with u; = u(xi) = Eow,) [Y] = g~!(B,x;) under the link choice g. If we prefer 
to work with an objective function that reflects the notion of a loss function, we 
can work under the unit deviances 0(Y;, pi) studied in Sect. 4.1.2. The MLE is then 
obtained by, see (4.20)-(4.21), 


4MLE 


n 
pM = argmax fy(B) = argmin Dy Yi, u), (5.18) 
B B 


i=) 
the latter satisfying 0(Y;, ui) > O for all 1 < i < n, and being zero if and 
only if Y; = ui, see Lemma 2.22. Thus, using the unit deviances we have a loss 
function that is bounded below by zero, and we determine the regression parameter 
B such that this loss is (in-sample) minimized. This can also be interpreted in a more 
geometric way. Consider the (q + 1)-dimensional manifold 9% C R” spanned by 
the GLM function 


B > w(B) = 87 (XB) = (71B, x1), ..., 87 (B. Xn) € R”. (5.19) 


1 The regularity conditions for asymptotic normality results will depend on the particular 
regression problem studied, we refer to pages 43—44 in Fahrmeir-Tutz [123]. 
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Fig. 5.2 2-dimensional 
manifold M c R? for 
observation 

Y = (Y1, %2, Y3)! € R3, the 
straight line illustrates the 

projection (w.r.t. the unit ANS N SSNS 
deviance distances d) of Y : f X 
onto Wt which gives MLE cS 
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Minimization (5.18) then tries to find the point a (£) in this manifold M c R” 
that minimizes simultaneously all unit deviances 0(Y;, -) w.r.t. the observation Y = 
(%1,..-, Yn)! € R”. Or in other words, the optimal parameter B is obtained by 
“projecting” observation Y onto this manifold M, where “projection” is understood 
as a simultaneous minimization of loss function ae 1 2 0(¥;, ui), see Fig. 5.2. In 
the un-weighted Gaussian case, this corresponds to the usual orthogonal projection 
as the next example shows, and in the non-Gaussian case it is understood in the KL 
divergence minimization sense as displayed in formula (4.11). 


Example 5.9 (Gaussian Case) Assume we have the Gaussian EDF case «(@) = 
67/2 with canonical link g(u) = h(u) = u. In this case, the manifold (5.19) is the 
linear space spanned by the columns of the design matrix X 


B > aP) = XB = ((B,x1),-.-,(B.Xn))' € R”. 


If additionally we assume v;/g = c > O0 for all 1 < i < n, the minimization 
problem (5.18) reads as 


n 

MLE X Vi : 

B'E = argmin JO = oj, ui) = argmin |Y — X613, 
B i=l B 


where we have used that the unit deviances in the Gaussian case are given by the 


square loss function, see Example 4.12. As a consequence, the MLE a is found 
by orthogonally projecting Y onto M = {XB| B € R1+!} c R”, and this orthogonal 


projection is given by ag EM. E 
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5.2 Actuarial Applications of Generalized Linear Models 


The purpose of this section is to illustrate how the concept of GLMs is used in 
actuarial modeling. We therefore explore the typical actuarial examples of claim 
counts and claim size modeling. 


5.2.1 Selection of a Generalized Linear Model 


The selection of a predictive model within GLMs for solving an applied actuarial 
problem requires the following choices. 


Choice of the Member of the EDF Select a member of the EDF that fits the 
modeling problem. In a first step, we should try to understand the properties of 
the data Y before doing this selection, for instance, do we have count data, do we 
have a classification problem, do we have continuous observations? 

All members of the EDF are light-tailed because the moment generating function 
exists around the origin, see Corollary 2.14, and the EDF is not suited to model 
heavy-tailed data, for instance, having a regularly varying tail. Therefore, a datum 
Y is sometimes first transformed before being modeled by a member of the EDF. 
A popular transformation is the logarithm for positive observations. After this 
transformation a member of the EDF can be chosen to model log(Y). For instance, 
if we choose the Gaussian distribution for log(Y), then Y will be log-normally 
distributed, or if we choose the exponential distribution for log(Y), then Y will 
be Pareto distributed, see Sect. 2.2.5. One can then model the transformed datum 
with a GLM. Often this provides very accurate models, say, on the log scale for the 
log-transformed data. There is one issue with this approach, namely, if a model 
is unbiased on the transformed scale then it is typically biased on the original 
observation scale; if the transformation is concave this easily follows from Jensen’s 
inequality. The problematic part now is that the bias correction itself often has 
systematic effects which means that the transformation (or the involved nuisance 
parameters) should be modeled with a regression model, too, see Sect. 5.3.9. In 
many cases this will not easily work, unfortunately. Therefore, if possible, clear 
preference should be given to modeling the data on the original observation scale (if 
unbiasedness is a central requirement). 


Choice of Link Function From a statistical point of view we should choose the 
canonical link g = h to connect the mean u of the model to the linear predictor 
n because this implies many favorable mathematical properties. However, as seen, 
sometimes we have different needs. Practical reasons may require that we have a 
model with additive or multiplicative effects, which favors the identity or the log- 
link, respectively. Another requirement is that the resulting canonical parameter 0 = 
(ho g!)(n) needs to be within the effective domain ©. If this effective domain is 
bounded, for instance, if it covers the negative real line as for the gamma model, 
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a (transformation of the) log-link might be more suitable than the canonical link 
because g~!(-) = — exp(-) has a strictly negative range, see Example 5.5. 


Choice of Features and Feature Engineering Assume we have selected the 
member of the EDF and the link function g. This gives us the relationship between 
the mean y and the linear predictor 7, see (5.5), 


w(x) = Egy [Y] = g7 (n(x)) = g7 (B, x). (5.20) 


Thus, the features x € X C R1I+! need to be in the right functional form so that 
they can appropriately describe the systematic effect via the function (5.20). We 
distinguish the following feature types: 


e Continuous real-valued feature components, examples are age of policyholder, 
weight of car, body mass index, etc. 

e Ordinal categorical feature components, examples are ratings like good- 
medium-bad or A-B-C-D-E. 

¢ Nominal categorical feature components, examples are vehicle brands, occupa- 
tion of policyholders, provinces of living places of policyholders, etc. The values 
that the categorical feature components can take are called levels. 

e Binary feature components are special categorical features that only have two 
levels, e.g. female-male, open-closed. Because binary variables often play a 
distinguished role in modeling they are separated from categorical variables 
which are typically assumed to have more than two levels. 


All these components need to be brought into a suitable form so that they can be 
used in a linear predictor n(x) = (B, x), see (5.20). This requires the consideration 
of the following points (1) transformation of continuous components so that they can 
describe the systematic effects in a linear form, (2) transformation of categorical 
components to real-valued components, (3) interaction of components beyond an 
additive structure in the linear predictor, and (4) the resulting design matrix X should 
have full rank q + 1 < n. We are going to describe these points (1)—(4) in the next 
section. 


5.2.2 Feature Engineering 
Categorical Feature Components: Dummy Coding 


Categorical variables need to be embedded into a Euclidean space. This embedding 
needs to be done such that the resulting design matrix X has full rank q + 1 < n. 
There are many different ways to do so, and the particular choice depends on 
the modeling purpose. The most popular way is dummy coding. We only describe 
dummy coding here because it is sufficient for our purposes, but we mention that 
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Table 5.1 Dummy coding 


ee a; =white |1 |0 |o [o [o jo [o [o jo |o 
example that maps the _ | | | | 
K = 11 levels (colors) to the äs = yellow a can [9 9 |0 JO J0 4 0 (0 J 0 
unit vectors of the 43 =orange |O |O |1 jO 10 [0 |0 10 [0 |0_ 
10-dimensional Euclidean a4 = red [O [O/O |1 |0 |0 |0 |0 |0 O 
space R! selecting the last as = magenta |0 |0 |0 Jo |1 Jo |O (0 00 
level ay; (brown color) as aç =violet |o |o |o lo [o [1 lo [o [o lo 
reference level, and showing = bine lo lo lo Jo lo lo {1 lo Jo Jo 
the resulting dummy vectors |_| tt EE 
xj as row vectors ag = cyan JO |0 |0 |0 [0 [0 (0 | 1 |0 [0 
l a= green (0 10 10 10 10 ojojo]j1jo 
ajo =beige |0/0/0 |0/0 |0 |0 |0 |0}1 
ai =brown |0 |O [o jo [o |o |o [o jo jo" 


there are also other codings like effects coding or Helmert’s contrast coding.” The 
choice of the coding will not influence the predictive model (if we work with 
a full rank design matrix), but it may influence parameter selection, parameter 
reduction and model interpretation. For instance, the choice of the coding is (more) 
important in medical studies where one tries to understand the effects between 
certain therapies. 

Assume that the raw feature component x; is a categorical variable taking K 
different levels {a),...,ax}. For dummy coding we declare one level, say ax, to 
be the reference level and all other levels are described relative to that reference 
level. Formally, this can be described by an embedding map 


T y= Ogata) ERE. (5.21) 
This is closely related to the categorical distribution in Sect.2.1.4. An explicit 


example is given in Table 5.1. 


Example 5.10 (Multiplicative Model) If we choose the log-link function 7 = 
g(u) = log(u), we receive the regression function for the categorical example of 
Table 5.1 


K-1 


Xj; +> exp(B, xj) = exp{o} | | exp {brlja}, (5.22) 
k=1 


including an intercept component. Thus, the base value exp{fo} is determined 


by the reference level aj} = brown, and any color different from brown has 
a deviation from the base value described by the multiplicative correction term 
expl fr Liza): 7 


? There is an example of Helmert’s contrast coding in Remarks 2.7 of lecture notes [392], and for 
more examples we refer to the UCLA statistical consulting website: https://stats.idre.ucla.edu/r/ 
library/r-library-contrast-coding-systems-for-categorical-variables/. 
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Remarks 5.11 


e Importantly, dummy coding leads to full rank design matrices X and, henceforth, 
Assumption 5.3 is fulfilled. 

e Dummy coding is different from one-hot encoding which is going to be 
introduced in Sect. 7.3.1, below. 

e Dummy coding needs some care if we have categorical feature components with 
many levels, for instance, considering car brands and car models we can get 
hundreds of levels. In that case we will have sparsity in the resulting design 
matrix. This may cause computational issues, and, as the following example 
will show, it may lead to high uncertainty in parameter estimation. In particular, 
the columns of the design matrix X of very rare levels will be almost collinear 
which implies that we do not receive very well-conditioned matrices in Fisher’s 
scoring method (5.12). For this reason, it is recommended to merge levels 
to bigger classes. In Sect.7.3.1, below, we are going to present a different 
treatment. Categorical variables are embedded into low-dimensional spaces, so 
that proximity in these spaces has a reasonable meaning for the regression task 
at hand. 


Example 5.12 (Balance Property and Dummy Coding) A main argument for the 
use of the canonical link function has been the fulfillment of the balance property, 
see Corollary 5.7. If we have categorical feature components and if we apply dummy 
coding to those, then the balance property is projected down to the individual levels 
of that categorical variable. Assume that columns 2 to K of design matrix X are 
used to model a raw categorical feature X; with K levels according to (5.21). In that 
case, columns 2 < k < K will indicate all observations Y; which belong to levels 
a,—1. Analogously to the proof of Corollary 5.7, we receive (summation i runs over 
the different instances/policies) 


n n 
> LaMLE [ujY;] = X xig Z gMLE [vi Y;] = X xia Y; = > viYi. 


i: Xi 1=4k—1 i=1 i=1 i: Xj,1=ak-1 


(5.23) 


Thus, we receive the balance property for all policies 1 < i < n that belong to level 
ak—1- 

If we have many levels, then it will happen that some levels have only very few 
observations, and the above summation (5.23) only runs over very few insurance 
policies with X;,1 = ax_1. Suppose additionally the volumes v; are small. This can 
lead to considerable estimation uncertainty, because the estimated prices on the left- 
hand side of (5.23) will be based too much on individual observations Y; having the 
corresponding level, and we are not in the regime of a law of large numbers that 
balances these observations. 

Thus, this balance property from dummy coding is a natural property under the 
canonical link choice. Actuarial pricing is very familiar with such a property. Early 
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distribution-free approaches have postulated this property resulting in the method of 
the total marginal sums, see Bailey and Jung [22, 206], where the balance property 
is enforced for marginal sums of all categorical levels in parameter estimation. 
However, if we have scarce levels in categorical variables, this approach needs 
careful consideration. E 


Binary Feature Components 


Binary feature components do not need a treatment different from the categorical 
ones, they are Bernoulli variables which can be encoded as 0 or 1. This is exactly 
dummy coding for K = 2 levels. 


Continuous Feature Components 


Continuous feature components are already real-valued. Therefore, from the view- 
point of ‘variable types’, the continuous feature components do not need any 
pre-processing because they are already in the right format to be included in scalar 
products. 

Nevertheless, in many cases, also continuous feature components need feature 
engineering because only in rare cases they directly fit the functional form (5.20). 
We give an example. Consider car drivers that have different driving experience and 
different driving skills. To explain experience and skills we typically choose the age 
of driver as explanatory variable. Modeling the claim frequency as a function of the 
age of driver, we often observe a U-shaped function, thus, a function that is non- 
monotone in the age of driver variable. Since the link function g needs to be strictly 
monotone, this regression problem cannot be modeled by (5.20), directly including 
the age of driver as a feature because this leads to monotonicity of the regression 
function in the age of driver variable. 

Typically, in such situations, the continuous variable is discretized to categorical 
classes. In the driver’s age example, we build age classes. These age classes 
are then treated as categorical variables using dummy coding (5.21). We will 
give examples below. These age classes should fulfill the requirement of being 
sufficiently homogeneous in the sense that insurance policies that fall into the 
same class should have a similar propensity to claims. This implies that we would 
like to have many small homogeneous classes. However, the classes should be 
sufficiently large, otherwise parameter estimation involves high uncertainty, see 
also Example 5.12. Thus, there is a trade-off to sufficiently meet both of these two 
requirements. 

A disadvantage of this discretization approach is that neighboring age classes 
will not be recognized by the regression function because, per se, dummy coding 
is based on nominal variables not having any topology. This is also illustrated by 
the fact, that all categorical levels (excluding the reference level) have, in view 
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of embedding (5.21), the same mutual Euclidean distance. Therefore, in some 
applications, one prefers a different approach by rather trying to find an appropriate 
functional form. For instance, we can pre-process a strictly positive raw feature 
component 7; to a higher-dimensional functional form 


X > Bik + Bok? + BsX? + Bs logi), (5.24) 


with regression parameter (6;,..., 64) |, i.e., we have a polynomial function of 
degree 3 plus a logarithmic term in this choice. If one does not want to choose 
a specific functional form, one often chooses natural cubic splines. This, together 
with regularization, leads to the framework of generalized additive models (GAMs), 
which is popular family of regression models besides GLMs; for literature on GAMs 
we refer to Hastie—Tibshirani [182], Wood [384], Ohlsson—Johansson [290], Denuit 
et al. [99] and Wiithrich—Buser [392]. In these notes we will not further pursue 
GAMs. 


Example 5.13 (Multiplicative Model) If we choose the log-link function 7 = 
g(u) = log(u) we receive a multiplicative regression function 


q 
x |> p(x) = exp(B, x) = exp{Bo} I] exp {B;x;}. 


j=l 


That is, all feature components x; enter the regression function in an exponential 
form. In general insurance, one may have specific variables for which it is explicitly 
known that they should enter the regression function as a power function. Having a 
raw feature X; we can pre-process it as X; œ> x; = log(x;). This implies 


q 
u(x) = exp(B, x) = explo 3f [| exp {Bjx;}. 
j=l, jAl 


which gives a power term of order fı. The GLM estimates in this case the power 
parameter that should be used for X;. If the power parameter is known, then one 
can even include this component as an offset; offsets are discussed in Sect. 5.2.3, 
below. a 


Interactions 


Naturally, GLMs only allow for an additive structure in the linear predictor. Similar 
to continuous feature components, such an additive structure may not always be 
suitable and one wants to model more complex interaction terms. Such interactions 
need to be added manually by the modeler, for instance, if we have two raw feature 
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components x; and X;, we may want to consider a functional form 
~ ~ ~ ~ ~~ Aw 
(X1, Xk) > Bixi + Boxk + P3X1Xk + Bax, Xk, 


with regression parameter (61, ..., Ba 

More generally, this manual feature engineering of adding interactions and of 
specifying functional forms (5.24) can be understood as a new representation of raw 
features. Representation learning in relation to deep learning is going to be discussed 
in Sect. 7.1, and this discussion is also related to Mercer’s kernels. 


5.2.3 Offsets 


In many heterogeneous portfolio problems with observations Y = (Y1, ..., Yn)", 
there are known prior differences between the individual risks Y;, for instance, the 
time exposure varies between the different policies i. Such known prior differences 
can be integrated into the predictors, and this integration typically does not involve 
any additional model parameters. A simple way is to use an offset (constant) in 
the linear predictor of a GLM. Assume that each observation Y; is equipped with a 
feature x; € ¥ and a known offset o; € R such that the linear predictor n; takes the 
form 


(xi 0i) gui) = ni = n(xi, 0i) = oi + (B, xi), (5.25) 


forall 1 <i < n. An offset o; does not change anything from a structural viewpoint, 
in fact, it could be integrated into the feature x; with a regression parameter that is 
identically equal to 1. 

Offsets are frequently used in Poisson models with the (canonical) log-link 
choice to model multiplicative time exposures in claim frequency modeling. Under 
the log-link choice we receive from (5.25) the following mean function 


(xi, 0i) œ> u(xi, 0i) = exp{n(x;, 0;)} = exp{o; + (B, xi)} = exp{o;} exp(B, xi). 


In this version, the offset o; provides us with an exposure exp{o;} that acts 
multiplicatively on the regression function. If w; = exp{o;} measures time, then 
wi is a so-called pro-rata temporis (proportional in time) exposure. 


Remark 5.14 (Boosting) A popular machine learning technique in statistical mod- 
eling is boosting. Boosting tries to step-wise adaptively improve a regression 
model. Offsets (5.25) are a simple way of constructing boosted models. Assume 
we have constructed a predictive model using any statistical model, and denote the 
resulting estimated means of Y; by /@;. The idea of boosting is that we select 
another statistical model and we try to see whether this second model can still find 
systematic structure in the data which has not been found by the first model. In view 
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of (5.25), we include the first model into the offset and we build a second model 
around this offset, that is, we may explore a GLM 


fj? =g! (eG) + (B, xi)) : 


If the first model is perfect we come up with a regression parameter B = 0, 
otherwise the linear predictor (B,x;) of the second model starts to compensate 
for weaknesses in A; ). Of course, this boosting procedure can then be iterated 
and one should stop boosting before the resulting model starts to over-fit to the 
data. Typically, this approach is applied to regression trees instead of GLMs, see 
Ferrario-Hémmerli [125], Section 7.4 in Wiithrich—Buser [392], Lee—Lin [241] and 
Denuit et al. [100]. 


5.2.4 Lab: Poisson GLM for Car Insurance Frequencies 


We present a first GLM example. This example is based on French motor third 
party liability (MTPL) insurance claim counts data. The data is described in detail 
in Chap. 13.1; an excerpt of the available MTPL data is given in Listing 13.2. For the 
moment we only consider claim frequency modeling. We use the following data: N; 
describes the number of claims, v; € (0, 1] describes the duration of the insurance 
policy, and ¥; describes the available raw feature information of insurance policy i, 
see Listing 13.2. 

We are going to model the claim counts N; with a Poisson GLM using the 
canonical link function of the Poisson model. In the Poisson approach there are two 
different ways to account for the duration of the insurance policy. Either we model 
Y; = Nj /v; with the Poisson model of the EDF, see Sect. 2.2.2 and Remarks 2.13 
(reproductive form), or we directly model N; with the Poisson distribution from the 
EF and treat the log-duration as an offset variable 0; = log v;. In the first approach 
we have for the log-link choice g(-) = h(-) = log(-) and dispersion g = 1 


yi(B, xi) — exi 
Y; = Ni/vi ~ fi; 0i, vi) = exp iye a » (5.26) 
l 


where x; € X is the suitably pre-processed feature information of insurance policy 
i, and with canonical parameter 6; = n(x;) = (8, xi). In the second approach we 
include the log-duration as offset into the regression function and model N; with 
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the Poisson distribution from the EF. Using notation (2.2) this gives us 


Ni ~ f (ni: Oi) = exp [ni (log vi + (B, x;)) — eos t(Bxr) 4 ani) (5.27) 
(B, xi) Ta e(B.xi) 


= I/v; 


+ a(nj) + nj; log v; ¢ , 


with canonical parameter 6; = n(xj,0;) = oi + (B, xi) = logu; + (B,x;) for 
observation nj = v;y;. That is, we receive the same model in both cases (5.26) 
and (5.27) under the canonical log-link choice for the Poisson GLM. 

Finally, we make the assumption that all observations N; are independent. There 
remains the pre-processing of the raw features X; to features x; so that they can be 
used in a sensible way in the linear predictors nj = n(x;, 0i) = 0; + (B, xi). 


Feature Engineering 
Categorical and Binary Variables: Dummy Coding 


For categorical and binary variables we use dummy coding as described in 
Sect. 5.2.2. We have two categorical variables VehBrand and Region, as well 
as a binary variable VehGas, see Listing 13.2. We choose the first level as 
reference level, and the remaining levels are characterized by (K — 1)-dimensional 
embeddings (5.21). This provides us with K — 1 = 10 parameters for VehBrand, 
K — 1 = 21 parameters for Region and K — 1 = 1 parameter for VehGas. 
Figure 5.3 shows the empirical marginal frequencies 1 = Y N;/ X vi on all 
levels of the categorical feature components VehBrand, Region and VehGas. 
Moreover, the blue areas (in the colored version) give confidence bounds of 


+2,/2/>° vi (under a Poisson assumption), see Example 3.22. The more narrow 


these confidence bounds, the bigger the volumes J- v; behind these empirical 
marginal estimates. 
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Fig. 5.3 Empirical marginal frequencies on each level of the categorical variables (lhs) 
VehBrand, (middle) Region, and (rhs) VehGas 
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Continuous Variables 


We consider feature engineering of the continuous variables Area, VehPower, 
VehAge, DrivAge, BonusMalus and log-Density (Density on the log 
scale); note that we map the Area codes (A,..., F)  (1,..., 6). Some of these 
variables do not show any monotonicity nor log-linearity in the empirical marginal 
frequency plots, see Fig. 5.4. 

These non-monotonicity and non-log-linearity suggest in a first step to build 
homogeneous classes for these feature components and use dummy coding for the 
resulting classes. We make the following choices here (motivated by the marginal 
graphs of Fig. 5.4): 


e Area: continuous log-linear feature component for {A,..., F} + {1,..., 6}; 

e VehPower: discretize into categorical classes where we merge vehicle power 
groups bigger and equal to 9 (totally K = 6 levels); 

e VehAge: we build categorical classes [0, 6), [6, 13), [13, oo) (totally K = 3 
levels); 

e DrivAge: we build categorical classes [18, 21), [21, 26), [26, 31), [31, 41), 
[41, 51), [51, 71), [71, co) (totally K = 7 levels); 

e BonusMalus: continuous log-linear feature component (we censor at 150); 

e Density: log-density is chosen as continuous log-linear feature component. 


This encoding is slightly different from Noll et al. [287] because of different data 
cleaning. The discretization has been chosen quite ad-hoc by just looking at the 
empirical plots; as illustrated in Section 6.1.6 of Wiithrich—Buser [392] regression 
trees may provide an algorithmic way of choosing homogeneous classes of sufficient 
volume. This provides us with a feature space (the initial component stands for the 
intercept x;,9 = 1 and the order of the terms is the same as in Listing 13.2) 


XC {1} x R x {0, 1} x {0, 1}? x {0, 1} x R x {0, 1}!? x {0, 1} x R x {0, 1}, 


of dimension q + 1 = 1+1+5+2+6+1+10+1+1+421 = 49. The R code [307] 
for this pre-processing of continuous variables is shown in Listing 5.1, categorical 
variables do not need any special treatment because variables of factor type are 
consider internally in R by dummy coding; we call this model Poisson GLM1. 


Choice of Learning and Test Samples 


To measure predictive performance we follow the generalization approach as 
proposed in Chap.4. This requires that we partition our entire data into learning 
sample £ and test sample 7, see Fig.4.1. Model selection and model fitting will 
be done on the learning sample £, only, and the test sample 7 is used to analyze 
the generalization of the fitted models to unseen data. We partition the data at 
random (non-stratified) in a ratio of 9 : 1, and we are going to hold on to the same 
partitioning throughout this monograph whenever we study this example. The R 
code used is given in Listing 5.2. 
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Listing 5.1 Pre-processing of features for model Poisson GLM1 in R 


dat$SAreaGLM <- as.integer (datS$Area) 
dat$VehPowerGLM <- as.factor(pmin(dat$VehPower, 9) ) 
dat$VehAgeGLM <- as.factor(cut (dat$VehAge, c(0,5,12,101), 
labels = c("0-5","6-12","12+"), 
include.lowest = TRUE) ) 
dat$DrivAgeGLM <- as.factor(cut (dat$DrivAge, c(18,20,25,30,40,50,70,101), 
labels = c("18-20","21-25","26-30","31-40","41-50", 
"S1-70","71+"), include.lowest = TRUE) ) 
dat$BonusMalusGLM <- pmin(dat$BonusMalus, 150) 
dat$DensityGLM <- log(dat$Density) 


Table 5.2 shows the summary of the chosen partition into learning and test 
samples 


L= IY; = Ni /vi, Xi, vi): i = 1,..., n = 610206}, 
and 
T= [oi = N} jvj, xl vD t=1,...,T = 67801}. 


In contrast to Sect. 4.2 we also include feature information and exposure information 
to £L and T. 


Listing 5.2 Partition of the data to learning sample £ and test sample T 


RNGversion("3.5.0") # we use R version 3.5.0 for this partition 
set.seed(500) 
LI <- sample (c(1:nrow(dat)), round(0.9*nrow(dat)), replace = FALSE) 


learn <- dat[1ll,] 
test <- dat[-11,] 


Table 5.2 Choice of learning data set £ and test data set 7; the empirical frequency on both 
data sets is similar (last column), and the split of the policies w.r.t. the numbers of claims is also 
rather similar 


] Numbers of observed claims ‘Empirical 
| 0 1 2 | 3 |4 5 frequency 
Learning sample £ | 96.32% | 3.47% |0.19% | 0.01% | 0.0006% |0.0002% | 7.36% 


Test sample T [96.31% |3.50% |0.18% |0.01% |0.0015% |0.0015% |7.35% 
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Maximum-Likelihood Estimation and Results 


The remaining step is to perform MLE to estimate regression parameter B € RI+!, 
This can be done either by maximizing the Poisson log-likelihood function or by 
minimizing the Poisson deviance loss. In view of (4.9) and Example 4.27, the 
Poisson deviance loss on the learning data £ is given by 


p> 20.8) == Yu (uan r- re(#E)) = 0, 6% 


i=] : 


where the terms under the summation are set equal to v; u(x;) for Y; = 0, see (4.8), 
and we have GLM regression function 


x > u(x) = ug(x) = exp(B, x). 


That is, we work under the canonical link with the canonical parameter being equal 
to the linear predictor. The MLE of £ is found by minimizing (5.28). This is done 
with Fisher’s scoring method. In order to receive a non-degenerate solution we need 
to ensure that we have sufficiently many claims Y; > 0, otherwise it might happen 
that the MLE provides a (degenerate) solution at the boundary of the effective 


: “4MLE — 4MLE : : 
domain ©. We denote the MLE by pp = B , because it has been estimated 
on the learning data £, only. This gives us estimated regression function 


x > BE) = ugue) = expBe” x) 


We emphasize that we only use the learning data £ for this model fitting. In view of 
Definition 4.24 we receive in-sample and out-of-sample Poisson deviance losses 


D(C, Be") = 235 (ax =% =, toe (A) > 0, 


i=l : 


(x}) 
yu (ra) - YÏ — Yj log (5 )) > 0. 
=l Yj 


We implement this GLM on the data of Listing 5.1 (and including the categorical 
features) in R using the function glm [307], a short overview of the results is 
presented in Listing 5.3. This overview presents the regression model implemented, 


MLE 


DT, Pe )= 


eA 


; MLE : ; 
an excerpt of the parameter estimates B;  , standard errors which are received 
from the square-rooted diagonal entries of the inverse of the estimated Fisher’s 


information matrix Zn EES, see (5.17); the remaining columns will be described 
in Sect. 5.3.2 on the Wald test (5.33). The bottom line of the output says that Fisher’s 
scoring algorithm has converged in 6 iterations, it gives the in-sample deviance loss 


nD(L, Br) called Residual deviance (not being scaled by the number of 


CmMAIDMPWNK 
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Listing 5.3 Results in model Poisson GLM1 using the R command glm 


Calls; 

glm(formula = ClaimNb ~ VehPowerGLM + VehAgeGLM 
BonusMalusGLM + VehBrand + VehGas 
AreaGLM, family = poisson(), data 


DrivAgeGLM + 
DensityGLM + Region + 
learn, offset = log (Exposure) ) 


I++ 


Deviance Residuals: 
Min 1Q Median 3Q Max 
-1.4728 -0.3256 -0.2456 -0.1383 Fe TOF 


Coefficients: 

Estimate Std. Error z value Pr(>!z!) 
(Intercept) -4.8175439 0,0579296 -83.162 < 2e-16 xxx 
VehPowerGLM5 0.0604293 0.0229841 2.629 0.008559 xx 
VehPowerGLM6 0.0868252 0.0225509 3.850 0.000118 xxx 
RegionR93 0.1388160 0.0294901 4.707 2.516-06 xxx 
RegionR94 0.1918538 0.0938250 2.045 0.040874 « 
AreaGLM 0.0407973 0.0200818 2.032 0.042199 «x 
Signif. codes: O eee’ 0.001 Tae’ 0.01 '#! 0,05 7.7 O11 7 7 2 


(Dispersion parameter for poisson family taken to be 1) 


Null deviance: 153852 on 610205 degrees of freedom 
Residual deviance: 147069 on 610157 degrees of freedom 
AIC: 192818 


Number of Fisher Scoring iterations: 6 


Table 5.3 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses, 
tenfold cross-validation losses with empirical standard deviation in brackets, see also (4.36), (units 
are in 107°) and the in-sample average frequency of the null model (Poisson intercept model, see 
Example 4.27) and of model Poisson GLM1 


[Run | # | AIC In-sample | Out-of-sample | Tenfold CV Aver. 
i | time | Param. loss on £ | loss on T loss DV freq. 
Poisson null |- | 1 [199506 | 25.213 | 25.445 | 25.213(0.234) | 7.36% 
Poisson GLM1 | 16s |49 | 1927818 | 24.101 | 24.146 | 24.121(0.245) | 7.36% 


observations), as well as Akaike’s Information Criterion (AIC), see Sect. 4.2.3 for 
AIC. Note that we have implemented Poisson version (5.27) with the exposures 
entering the offset, see lines 2—4 of Listing 5.3; this is important for understanding 
AIC being calculated on the (unscaled) claim counts Nj. 

Table 5.3 summarizes the results of model Poisson GLM1 and it compares the 
figures to the null model (only having an intercept fo); the null model has already 
been introduced in Example 4.27. We present the run time needed to fit the model,’ 
the number of regression parameters q + 1 in B € R1+!, AIC, in-sample and 
out-of-sample deviance losses, as well as tenfold cross-validation losses on the 


3 All run times are measured on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz 
1.99 GHz with 16 GB RAM, and they only correspond to fitting the model (or the corresponding 
step) once, i.e., they do not account for multiple runs, for instance, for K -fold cross-validation. 
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learning data £. For tenfold cross-validation we always use the same (non-stratified) 
partition of £ (in all examples in this monograph), and in bracket we show the 
empirical standard deviation received by (4.36). Tenfold cross-validation would not 
be necessary in this case because we have test data 7 on which we can evaluate the 
out-of-sample deviance GL. We present both figures to back-test whether tenfold 
cross-validation works properly in our example. We observe that the out-of-sample 


deviance losses D(T, Br) are within one empirical standard deviation of the 
tenfold cross-validation losses DOV, which supports this methodology of model 
comparison. 

From Table 5.3 we conclude that we should prefer model Poisson GLM1 over 
the null model, this decision is supported by a smaller AIC, a smaller out-of-sample 


deviance loss D (T, Br) as well as a smaller cross-validation loss DCV. The last 
column of Table 5.3 confirms that the estimated model meets the balance property 
(we work with the canonical link here). Note that this balance property should be 
fulfilled for two reasons. Firstly, we would like to have the overall portfolio price on 
the right level, and secondly, deviance losses should only be compared on the same 
overall frequency, see Example 4.10. 

Before we continue to introduce more models to challenge model Poisson 
GLM1, we are going to discuss statistical tools for model evaluation. Of course, 
we would like to know whether model Poisson GLM1 is a good model for this data 
or whether it is just the better model of two bad options. 


Remark 5.15 (Prior and Posterior Information) Pricing literature distinguishes 
between prior feature information and posterior feature information, see Verschuren 
[372]. Prior feature information is available at the inception of the (new) insurance 
contract before having any claims history. This includes, for instance, age of driver, 
vehicle brand, etc. For policy renewals, past claims history is available and prices 
of policy renewals can also be based on such posterior information. Past claims 
history has led to the development of so-called bonus-malus systems (BMS) which 
often are in the form of multiplicative factors to the base premium to reward and 
punish good and bad past experience, respectively. One stream of literature studies 
optimal designs of BMS, we refer to Loimaranta [255], De Pril [91], Lemaire [245], 
Denuit et al. [102], Brouhns et al. [57] Pinquet [304], Pinquet et al. [305], Tzougas 
et al. [360] or Agoston—Gyetvai [4]. Another stream of literature studies how one 
can optimally extract predictive information from an existing BMS, see Boucher- 
Inoussa [46], Boucher—Pigeon [47] and Verschuren [372]. 

The latter is basically what we also do in the above example: note that we include 
the variable BonusMalus into the feature information and, thus, we use past 
claims information to predict future claims. For new policies, the bonus-malus level 
is at 100%, and our information does not allow to clearly distinguish between new 
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policies and policy renewals for drivers that have posterior information reflected by 
a bonus-malus level of 100%. Since young drivers are more likely new customers we 
expect interactions between the driver’s age variable and the bonus-malus level, this 
intuition is supported by Fig. 13.12 (lhs). In order to improve our model, we would 
require more detailed information about past claims history. Remark that we do 
not strictly distinguish between prior and posterior information, here. If we go over 
to a time-series consideration, where more and more claims experience becomes 
available of an individual driver, we should clearly distinguish the different sets of 
information, because otherwise it may happen that in prior and posterior pricing 
factors we correct twice for the same factor; an interesting paper is Corradin et 
al. [82]. 

We also mention that a new source of posterior information is emerging through 
the collection of telematics car driving data. Telematics car driving data leads to a 
completely new way of posterior information rate making (experience rating), we 
refer to Ayuso et al. [17—19], Boucher et al. [42], Lemaire et al. [246] and Denuit 
et al. [98]. We mention the papers of Gao et al. [152, 154] and Meng et al. [271] 
who directly extract posterior feature information from telematics car driving data 
in order to improve rate making. This approach combines a Poisson GLM with a 
network extractor for the telematics car driving data. 


5.3 Model Validation 


One of the purposes of Chap. 4 has been to describe measures to analyze how well 
a fitted model generalizes to unseen data. In a proper generalization analysis this 
requires learning data £ for in-sample model fitting and a test sample 7 for an 
out-of-sample generalization analysis. In many cases, one is not in the comfortable 
situation of having a test sample. In such situations one can use AIC that tries to 
correct the in-sample figure for model complexity or, alternatively, K-fold cross- 
validation as used in Table 5.3. 

The purpose of this section is to introduce diagnostic tools for fitted models; these 
are often based on unit deviances 0(Y;, ui), which play the role of squared residuals 
in classical linear regression. Moreover, we discuss parameter and model selection, 
for instance, by step-wise backward elimination or forward selection using the 
analysis of variance (ANOVA) or the likelihood ratio test (LRT). 


5.3.1 Residuals and Dispersion 


Within the EDF we distinguish two different types of residuals. The first type of 
residuals are based on the unit deviances 0(Y;, ui) studied in (4.7). The deviance 
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residuals are given by 


D ; J Vi 
ri = sign(Y; — ui) y d (Yi, pi). 


Secondly, Pearson’s residuals are given by, see also (4.12), 


P = Yi — Mi 
' P vV (ui) 
In the Gaussian case the two residuals coincide. This indicates that Pearson’s 
residuals are most appropriate in the Gaussian case because they respect the 
distributional properties in that case. For other distributions, Pearson’s residuals 
can be markedly skewed, as stated in Section 2.4.2 of McCullagh—Nelder [265], 
and therefore may fail to have properties similar to Gaussian residuals. An other 
issue occurs in Pearson’s residuals when the denominator involves an estimated 
standard deviation ./V (fi), for instance, if we work in a small frequency Poisson 
problem. Estimation uncertainty in small denominators of Pearson’s residuals may 
substantially distort the estimated residuals. For this reason, we typically work with 
(the more robust) deviance residuals; this is related to the discussion in Chap. 4 on 
MSEPs versus expected deviance GLs, see Remarks 4.6. 
The squared residuals provide unit deviance and weighted square loss, respec- 
tively, 


vi (Yi — Mi)? 


Dy2 _ Vary. u ‘y= 
a V) 


the latter corresponds to Pearson’s x?-statistic, see (4.12). 
Example 5.16 (Residuals in the Poisson Case) In the Poisson case, Pearson’s x- 
statistic is for v; = g = | given by 


Y, — u2 
rP)? — (Yi hi) 
Hi 


( 


, 


because we have variance function V (u) = u. A second order Taylor expansion 


around Y; on the scale ul R (for ui) provides approximation to the unit deviances in 
the Poisson case, see formula (6.4) and Figure 6.2 in McCullagh—Nelder [265], 


2 
(Yi, ui) ~ oY)? (4? - ui”) (5.29) 


This emphasizes the different behaviors around the observation Y; of the two types 
of residuals in the Poisson case. The scale ie, /? has been motivated in McCullagh- 
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Fig. 5.5 Log-likelihoods £y (u) in Y = 1 as a function of u plotted against (lhs) pl! 3 in the 
Poisson case, (middle) pot! 3 in the gamma case with shape parameter œ = 1, and (rhs) pw! in the 
inverse Gaussian case witha = | 


Nelder [265] by providing a symmetric behavior around the mode in Y; = 1 of the 


resulting log-likelihood function, see Fig. 5.5 (lhs). 
a 


The explicit calculation of the residuals requires knowledge of the dispersion 
parameter p > 0. In the Poisson Example 5.16 this dispersion parameter has been 
set equal to 1 because the Poisson model does neither allow for under- nor for 
over-dispersion. Typically, this is not the case for other models, and this requires 
determination of the dispersion parameter if we want to simulate from these other 
models. So far, this dispersion parameter has been treated as a nuisance parameter 
and, in fact, it canceled in MLE (because it was assumed to be constant), see 
Proposition 5.1. 

If we need to estimate the dispersion parameter, we can either do this within 
MLE, see Remarks 5.2, or we can use Pearson’s or the deviance estimates, 
respectively, 


p 1 n (Y; — MAL pe 1 n 
L L ~ 
De —— ad P = — vid (Yi, Mi), 
E V (fài) /vi a A 
(5.30) 
where ji; = j£(x;) are the MLE estimated means involving q + 1 estimated 


parameters f €e R1+!, We briefly motivate these choices. Firstly, Pearson’s 
estimate Q” is consistent for ø. Note that in the Gaussian case this is just the standard 
estimate for the variance parameter. Justification of the deviance dispersion estimate 
is more challenging. Consider the unscaled deviance with f, = (1, ..., Hn) 
see (4.9), 


n 


noD Yn, fy) = >) vid (Yi, fi). 


i=l 
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Fig. 5.6 Expected unit deviance vE,,[0(Y, 4)] in the Poisson case as a function of E[N] = 
{[vY] = vy; the two plots only differ in the scale on the x-axis 


This statistic is under certain assumptions asymptotically px 2 @ 41) distributed, 
where x a 4+1) denotes a x?-distribution with n — (q + 1) degrees of freedom. Thus, 
this approximation gives us an expected value of ø (n — (q + 1)). This exactly justifies 
the deviance dispersion estimate (5.30) in these cases. However, as stated in the last 
paragraph of Section 2.3 of McCullagh—Nelder [265], often a x7-approximation is 
not suitable even as n — oo. We give an example. 


Example 5.17 (Poisson Unit Deviances) The deviance statistics in the Poisson 
model with means pt, = (u1, -.-, Un)! is given by 


12 i li 
D(Y,, Hy) = go (Yi, Mi) = n 2a (u -Yi -Yi log (=) , 
i= = 


note that in the Poisson model we have (by definition) ọ = 1. We evaluate the 
expected value of this deviance statistics. It is given by 


Vi Hi 


1 n , li 1 n N; 
in, [OW ns Hy)| = = > 2i Epy; [w — Y; — Y; log (4) E X 2E; [~ we(2)], 
i=1 i=1 


with N; $ Poi(v; ui). 

In Fig. 5.6 we plot the expected unit deviance vu > vE,,[0(Y, 2)] in the Poisson 
model. In our example of Table 5.3, we have E,,[vY] = vu © 3.89%, which results 
in an expected unit deviance of vE,,[0(Y, w)] ~ 25.52. 107? < 1. This is in line with 
the losses in Table 5.3. Thus, the expected deviance nE p, [D(Yn, By) | wn/4 <n, 
Therefore it is substantially smaller than n. But this implies that nÐ (Y n, 4, ) cannot 
be asymptotically ka a +p-distributed because the latter has an expected of value 
n—(q+1) ~ n forn — œ. In fact, the deviance dispersion estimate is not consistent 


5.3 Model Validation 145 


in this example, and for a consistent estimate one should rely on Pearson’s deviance 
estimate. 

In order to have an asymptotic x?-distribution we need to have large volumes 
v because then a saddlepoint approximation holds that allows to approximate the 
(scaled) unit deviances by x?-distributions, see Sect. 5.5.2, below. | 


5.3.2 Hypothesis Testing 


Consider a sub-vector £, € R” of the GLM parameter B € RIt!, forr < q +1. 
We would like to understand if we can set this sub-vector B, = 0, and at the same 
time we do not lose any generalization power. Thus, we investigate whether there is 
a simpler nested GLM that provides a similar prediction accuracy. If this is the case, 
preference should be given to the simpler model because the bigger model seems 
over-parametrized (has redundancy, is not parsimonious). This section is based on 
Section 2.2.2 of Fahrmeir—Tutz [123]. 


Geometric Interpretation We begin by giving a geometric interpretation. We start 
from the full model being expressed by the design matrix X € R’*%+"), This design 
matrix together with the link function g generates a (q + 1)-dimensional manifold 
Dt c R” given by, see (5.19) and Fig. 5.2, 


M= [u = g EP = (648,21), .-., 87 (B, Xn) ER"| BER c R. 


The MLE pF is determined by the point in M that minimizes the distance to Y, 
where distance between Y and M is measured component-wise by go; , Hi) with 
H E€ M, i.e., w.r.t. the KL divergence. 

Assume, now, that we want to drop the components £, in B, i.e., we want to drop 
these columns from the design matrix resulting in a smaller design matrix X, € 
R’*4+!—-")_ This generates a (q + 1 — r)-dimensional nested manifold IN, C M 
described by 


M, = [u = Ep) eR" B e R=] cM. 


If the distance of Y to WM, and {Mt is roughly the same, we should go for 
the smaller model. In the Gaussian case of Example 5.9 this can be explained 
by the Pythagorean theorem applied to successive orthogonal projections. In the 
general unit deviance case, this has to be studied in terms of information geometry 
considering the KL divergence, see Sect. 2.3. 
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Likelihood Ratio Test (LRT) We consider the testing problem of the null hypoth- 
esis Ho against the alternative hypothesis Hı 


Ho: B, =0 against Hı : B, #0. (5.31) 


Denote by g the MLE under the full model and by BES the MLE under the 
null hypothesis model. Define the (log-)likelihood ratio test (LRT) statistics 


A=-2(tr@Cn) -r @) = 0. 


The inequality holds because the null hypothesis model is nested in the full model, 
henceforth, the latter needs to have a bigger log-likelihood value in the MLE. If 
the LRT statistics A is large, the null hypothesis should be rejected because the 
reduced model is not competitive compared to the full model. More mathematically, 
under similar conditions as for the asymptotic normality results of the MLE of 
B in (5.17), we have that under the null hypothesis Ho the LRT statistics A is 
asymptotically x?-distributed with r degrees of freedom. Therefore, we should 
reject the null hypothesis in favor of the full model if the resulting p-value of A 
under the y7-distribution is too small. These results remain true if the unknown 
dispersion parameter g is replaced by a consistent estimator @, e.g., Pearson’s 
dispersion estimate 9° (from the bigger model). 

The LRT statistics A may not be properly defined in over-dispersed situations 
where the distributional assumptions are not fully specified, for instance, in an over- 
dispersed Poisson model. In such situations, one usually divides the log-likelihood 
(of the Poisson model) by the estimated over-dispersion and then uses the resulting 
scaled LRT statistics A as an approximation to the unspecified model. 


Wald Test Alternatively, we can use the Wald statistics. The Wald statistics uses 
a second order approximation to the log-likelihood and, therefore, is only based 
on the first two moments (and not on the entire distribution). Define the matrix 
I. € R'*4* such that £ , = 1,B,i.e., matrix I, selects exactly the components of 
£ that are included in £, (and which are set to 0 under the null hypothesis Hp given 
in (5.31)). 

Asymptotic normality (5.17) motivates consideration of the Wald statistics 


w= BM —0)' a T | UBM - 0). (5.32) 


The Wald statistics measures the distance between the MLE in the full model 
[-B restricted to the components of 6, and the null hypothesis Ho (being 


B, = 0). The estimated Fisher’s information matrix Z Go) is used to bring 
all components onto the same unit scale (and to account for collinearity). The 
Wald statistics W is asymptotically x?-distributed under the same assumptions as 
for (5.17) to hold. Thus, the null hypothesis Ho should be rejected if the resulting p- 
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value of W under the x7-distribution is too small. Note that this test does not require 
calculation of the MLE in the null hypothesis model, i.e., this test is computationally 
more attractive than the LRT because we only need to fit one model. Again, an 
unknown dispersion parameter ¢ in Fisher’s information matrix Z(B) is replaced by 
a consistent estimator @ (from the bigger model). 

In the special case of considering only one component of £$, i.e., if 8, = Bx with 
r = | and for one selected component 0 < k < q, the Wald statistics reduces to 


@MLE\ 2 @MLE 
men E y ne (5.33) 
o% Ok 


with diagonal entries of the inverse of the estimated Fisher’s information matrix 


given by a =(1 Cle as 0 < k < q. The square-roots of these estimates are 


provided in column Std. Error of the R output in Listing 5.3. 

In this case the Wald statistics W; is equal to the square of the t-statistics Tk; 
this t-statistics is provided in column z value of the R output of Listing 5.3. 
Remark that Fisher’s information matrix involves the dispersion parameter ø. If 
this dispersion parameter is estimated with a consistent estimator @ we have a t- 
statistics. For known dispersion parameter the t-statistics reduces to a z-statistics, 
i.e., the corresponding p-values can be calculated from a normal distribution instead 
of a t-distribution. In the Poisson case, the dispersion g = 1 is known, and for this 
reason, we perform a z-test (and not a t-test) in the last column of Listing 5.3; and 
we call T; a z-statistics in that case. 


5.3.3 Analysis of Variance 


In the previous section, we have presented tests that allow for model selection in 
the case of nested models. More generally, if we have a full model, say, based 
on regression parameter B e R1! we would like to select the “best” sub- 
model according to some selection criterion. In most cases, it is computationally 
not feasible to fit all sub-models if q is large, therefore, this is not a practical 
solution. For large models and data sets step-wise procedures are a feasible tool. 
Backward elimination starts from the full model, and then recursively drops feature 
components which have high p-values in the corresponding Wald statistics (5.32) 
and (5.33). Performing this recursively will provide us with hierarchy of nested 
models. Forward selection works just in the opposite direction, that is, we start with 
the null model and we include feature components one after the other that have a 
low p-value in the corresponding Wald statistics. 


148 5 Generalized Linear Models 


Remarks 5.18 


e The order of the inclusion/exclusion of the feature components matters in this 
selection algorithms because we do not have additivity in this selection process. 
For this reason, often backward elimination and forward selection is combined 
in an alternating way. 

e This process as well as the tests from Sect.5.3.2 are based on a fixed pre- 
processing of features. If the feature pre-processing is done differently, all 
analysis needs to be repeated for this new model. Moreover, between two dif- 
ferent models we need to apply different tools for model selection (if they are not 
nested), for instance, AIC, cross-validation or an out-of-sample generalization 
analysis. 

e For categorical variables with dummy coding we should apply the forward 
selection or the backward elimination simultaneously on the entire dummy coded 
vector of a categorical variable. This will include or exclude this variable; if we 
only apply the Wald test to one component of the dummy vector, then we test 
whether this level should be merged with the reference level. 


Typically, in practice, a so-called analysis of variance (ANOVA) table is studied. 
The ANOVA table is mainly motivated by the Gaussian model with orthogonal 
data. The Gaussian assumption implies that the deviance loss is equal to the 
square loss and the orthogonality implies that the square loss decouples in an 
additive way w.r.t. the feature components. This implies that one can explicitly 
study the contribution of each feature component to the decrease in square loss; 
an example is given in Section 2.3.2 of McCullagh—Nelder [265]. In non-Gaussian 
and non-orthogonal situations one loses this additivity property and, as mentioned 
in Remarks 5.18, the order of inclusion matters. Therefore, for the ANOVA table 
we pre-specify the order in which the components are included and then we analyze 
the decrease of deviance loss by the inclusion of additional components. 


Example 5.19 (Poisson GLM1, Revisited) We revisit the MTPL claim frequency 
example of Sect. 5.2.4 to illustrate the variable selection procedures. Based on the 
model presented in Listing 5.3 we run an ANOVA analysis using the R command 
anova, the results are presented in Listing 5.4. 

Listing 5.4 shows the hierarchy of models starting from the null model by 
sequentially including feature components one by one. The column Df gives the 
number of regression parameters involved and the column Deviance the decrease 
of deviance loss by the inclusion of this feature component. The biggest model 
improvements are provided by the bonus-malus level and driver’s age, this is not 
surprising in view of the empirical analysis in Figs. 5.3 and 5.4, and in Chap. 13.1. 
At the other end we have the Area code which only seems to improve the model 
marginally. However, this does not imply, yet, that this variable should be dropped. 
There are two points that need to be considered: (1) maybe feature pre-processing 
of Area has not been done in an appropriate way and the variable is not in the 
right functional form for the chosen link function; and (2) Area is the last variable 
included in the model in Listing 5.4 and, maybe, there are already other variables 
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Listing 5.4 ANOVA table of model Poisson GLM1 


Analysis of Deviance Table 
Model: poisson, link: log 
Response: ClaimNb 


Terms added sequentially (first to last) 


Df Deviance Resid. Df Resid. Dev 


NULL 610205 53852 
VehPowerGLM 5 Td 610200 53.779 
VehAgeGLM 2 179,7 610198 53599 
DrivAgeGLM 6 1199.4 610192 52400 
BonusMalusGLM 1 4300.6 610191 48099 
VehBrand 10 240.3 610181 47859 
VehGas t 82.4 610180 47776 
DensityGLM 1 coy bee a 610179 47264 
Region 21 191.3 610158 47073 
AreaGLM T 4.1 610157 47069 


that take over the role of Area in smaller models which is possible if we have 
correlations between the feature components. In our data, Area and Density are 
highly correlated. For this reason, we exchange the order of these two components 
and run the same analysis again, we call this model Poisson GLM1B (which of 
course provides the same predictive model as Poisson GLM1). 


Listing 5.5 ANOVA table of model Poisson GLM1B 
Analysis of Deviance Table 

Model: poisson, link: log 

Response: ClaimNb 


Terms added sequentially (first to last) 


Df Deviance Resid. Df Resid. Dev 


NULL 610205 53852 
VehPowerGLM 5 73.7 610200 53779 
VehAgeGLM 2 179.7 610198 53599 
DrivAgeGLM 6 1199.4 610192 52400 
BonusMalusGLM 1 4300.6 610191 48099 
VehBrand 10 240.3 610181 47859 
VehGas 4 82.4 610180 47776 
AreaGLM T 505.0 610179 47271 
Region 21 192.4 610158 47079 
DensityGLM na 10). A. 610157 47069 


Listing 5.5 shows the ANOVA table if we exchange the order of these two 
variables. We observe that the magnitudes of the decrease of the deviance loss 
has switched between the two variables. Overall, Density seems slightly more 
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predictive, and we may consider dropping Area from the model, also because the 
correlation between Density and Area is very high. 

If we want to perform backward elimination (sequentially drop one variable after 
the other) we can use the R command drop1. For small models this is doable, for 
larger models it is computationally demanding. 


Listing 5.6 drop1 analysis of model Poisson GLM1 
Single term deletions 
Model: 


ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAgeGLM + BonusMalusGLM + 
VehBrand + VehGas + DensityGLM + Region + AreaGLM 


Df Deviance AIC LRT Pr(>Chi) 
<none> 147069 192818 
VehPowerGLM 5 147152 192892 83.4 < 2.2€-16 xxx 
VehAgeGLM 2 47283 193028 214.1 < 2.2e-16 «x* 
DrivAgeGLM 6 147603 193341 534.5 < 2.2e€-16 «xx 
BonusMalusGLM 1 150970 196718 3901.5 < 2.2€-16 «xx 
VehBrand 10 47298 193027 228.9 < 2.2e-16 *** 
VehGas 1 47213 192961 144.5 < 2.2e-16 *«** 
DensityGLM L 47079 192826 10.1 0.001459 xx 
Region 21 47259 192967 190.7 < 2.2e€-16 *** 
AreaGLM ay 47073 192820 4.1 0.042180 « 
Signif. codes: O eke? 0001 "ak" D 0L et OOS. Fo Onde Fe) a 


In Listing 5.6 we present the results of this drop1 analysis. Both, according to 
AIC and according to the LRT, we should keep all variables in the model. Again, 
Area and Density provide the smallest LRT statistics A which illustrates the 
high collinearity between these two variables (note that the values in Listing 5.6 are 
identical to the ones in Listings 5.4 and 5.5, respectively). 

We conclude that in model Poisson GLM1 we should keep all feature com- 
ponents, and a model improvement can only be obtained by a different feature 
pre-processing, by a different regression function or by a different distributional 
model. E 


5.3.4 Lab: Poisson GLM for Car Insurance Frequencies, 
Revisited 


Continuous Coding of Non-monotone Feature Components 


We revisit model Poisson GLM1 studied in Sect. 5.2.4 for MTPL claim frequency 
modeling, and we consider additional competing models by using different feature 
pre-processing. From Example 5.19, above, we conclude that we should keep all 
variables in the model if we work with model Poisson GLM1. 
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Table 5.4 Contingency table of observed number of policies against predicted number of 
policies with given claim counts ClaimNb 


Numbers of claims ClaimNb 


0 1 2 3 4 5 
Observed number of policies 587°772 217198 1174 | 57 4 
Predicted number of policies 587325 22064 719 34 3 0.3 


We calculate Pearson’s dispersion estimate which provides @? = 1.6697 > 1. 
This indicates that the model is not fully suitable for our data because in a Poisson 
model the dispersion parameter should be equal to 1. There may be two reasons 
for this over-dispersion: (1) the Poisson assumption is not appropriate because, 
for instance, the tail of the observations is more heavy-tailed, or (2) the Poisson 
assumption is appropriate but the regression function has not been chosen in a fully 
suitable way (maybe also due to missing feature information). 

We believe that in our example the observed over-dispersion is a mixture of 
the two reasons (1) and (2). Surely, the regression structure can be improved since 
our feature pre-processing is non-optimal and since the chosen regression function 
only considers multiplicative interactions between the feature components (we have 
chosen the log-link regression function without adding interaction terms to the 
regression function). 

Table 5.4 gives a contingency table. We observe that we have much more policies 
with more than 1 claim compared to what is predicted by the fitted model. As a 
result, a x?-test rejects this Poisson model because the resulting p-value is close 
to 0. 

In our data, we have a rather large number of policies with short exposures vj, 
and further analysis suggests that these short exposures are not suitably modeled. 
We will not invest more time into improving the exposure modeling. As mentioned 
in the appendix, there seem to be a couple of issues how the exposures are displayed 
and how policy renewals are accounted for in this data. However, it is difficult 
(almost impossible) to clean the data for better exposure measures without more 
detailed information about the data collection process. 

Our next aim is to model continuous feature components differently, if their raw 
form does not match the linear predictor assumption. In Poisson GLM1 we have 
categorized such components and then used dummy coding for the resulting classes, 
see Sect. 5.2.4. Alternatively, we can use different functional forms, for instance, we 
can use for DrivAge the following pre-processing 


4 
DrivAge > f; DrivAge + $41 log(DrivAge) + > Bi+j (DrivAge)/. 
j=2 
(5.34) 
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Table 5.5 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses, 
tenfold cross-validation losses (units are in 107°) and in-sample average frequency of the null 
model (intercept model) and of different Poisson GLMs 


Run | # In-sample | Out-of-sample | Tenfold CV | Aver. 

time | Param. | AIC loss on £ | loss on T loss DEV freq. 
Poisson null - 1 199°506 | 25.213 25.445 25.213 7.36% 
Poisson GLM1 | 16s | 49 192°818 |24.101 | 24.146 24.121 7.36% 
Poisson GLM2 | 15s | 48 192°753 | 24.091 24.113 24.110 7.36% 
Poisson GLM3 | 15s_ | 50 192°716 | 24.084 24.102 24.104 7.36% 


This replaces the K = 7 categorical age classes of model Poisson GLM1 by 
5 continuous functions of the variable DrivAge, and the number of regression 
parameters is reduced from K — | = 6 to 5. We call this model Poisson GLM2. 

Besides improving the modeling of the feature components we can also start 
to add interactions beyond the multiplicative ones. For instance, Fig. 13.12 in 
Chap. 13 may indicate that there is an interaction term between BonusMalus 
and DrivAge. New young drivers enter the bonus-malus system at level 100, 
and it takes some years free of accidents to reach the lowest bonus-malus level 
of 50. Whereas for senior drivers a bonus-malus level of 100 may indicate that they 
have had a bad claim experience because otherwise they would be on the lowest 
bonus-malus level, see also Remark 5.15. We are adding the following interaction 
to Poisson GLM2 and we call the resulting model Poisson GLM3 


by BonusMalus - DrivAge + 6y,,;BonusMalus - (DrivAge)*. (5.35) 


From Table 5.5 we observe that this leads to a further small model improvement. 
We mention that this model improvement can also be observed in a decrease of 
Pearson’s dispersion estimate to 9” = 1.6644. Noteworthy, all model selection 
criteria AIC, out-of-sample generalization loss and cross-validation come to the 
same conclusion in this example. 

The tedious task of the modeler now is to find all these systematic effects and 
bring them in an appropriate form into the model. Here, this is still possible because 
we have a comparably small model. However, if we have hundreds of feature 
components, such a manual analysis becomes intractable. Other regression models 
such as network regression models should be preferred, or at least should be used 
to find systematic effects. But, one should also keep in mind that the (final) chosen 
model should be as simple as possible (parsimonious). 


Remarks 5.20 


e An advantage of GLMs is that these regression models can deal with collinearity 
in feature components. Nevertheless, the results should be carefully checked if 
the collinearity in feature components is very high. If we have a high collinearity 
between two feature components then we may observe large values with opposite 
signs in the corresponding regression parameters compensating each other. The 
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Listing 5.7 drop1 analysis of model Poisson GLM2 


Single term deletions 


Model: 

ClaimNb ~ VehPowerGLM + VehAgeGLM + DrivAge + log(DrivAge) + 
I(DrivAge*2) + I(DrivAge*3) + I(DrivAge*4) + BonusMalusGLM + 
VehBrand + VehGas + DensityGLM + Region + AreaGLM 

Df Deviance AIC LRT Pr(>Chi) 

<none> 47005 192753 

VehPowerGLM 5 147087 192825 82.4 2.671e-16 xxx 

VehAgeGLM 2 47225 192969 220.3 «< 2.2e-16 «** 

DrivAge 1 47157 192902 151.9 «< 2.2e-16 «x* 

log (DrivAge) 1 147190 192935 184.8 < 2.2e€-16 «xx 

I (DrivAge*2) T 47123 192869 118.1 < 2.2e-16 *** 

I (DrivAge*3) 1: 47094 192840 89.0 < 2.2e€-16 xxx 

I (DrivAge*4) al 47071 192816 65.5 5.687e-16 *«** 

BonusMalusGLM 1 50907 196653 3902.0 «< 2.2e-16 *x* 

VehBrand 10 47232 192959 226.5 < 2.2e-16 xxx 

VehGas di 47148 192893 142.8 < 2.2e-16 xxx 

DensityGLM 1 47015 192761 10.1 0.001498 xx 

Region 21 47193 192899 188.0 < 2.2e-16 xxx 

AreaGLM di 47009 192755 4.1 0.043123 + 

Signif. codes: O ass’ 0.001 ‘«**’ 0.01 a 0.05 '.’ 0.1 ' ' 1 


resulting GLM will not be very robust, and a slight change in the observations 
may change these regression parameters completely. In this case one should drop 
one of the two highly collinear feature components. This problem may also occur 
if we include too many terms in functional forms like in (5.34). 

e A tool to find suitable functional forms of regression functions in continuous 
feature components are the partial residual plots of Cook—Croos-Dabrera [80]. If 
we want to analyze the first feature component x; of x, we can fit a GLM to the 
data using the entire feature vector x. The partial residuals for component x; are 
defined by, see formula (8) in Cook—Croos-Dabrera [80], 


partial 


ri = (Y; — u(xi))g' (u (x;)) + P1xi,1 forl <i<n, 


where g is the chosen link function and g(u(x;)) = (B, xi). These partial 
residuals offset the effect of feature component x. The partial residual plot shows 
partial 


ri against x; 1. If this plot shows a linear structure then including x, linearly 


is justified, and any other functional form may be detected from that plot. 


Under-Sampling and Over-Sampling 


Often run times are an issue in model fitting, in particular, if we want to exper- 
iment with different models, different feature codings, etc. Under-sampling is an 
interesting approach that can be applied in imbalanced situations (like in our claim 
frequency data situation) to speed up calculations, and still receiving accurate 
approximations. We briefly describe under-sampling in this subsection. 
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Under-sampling is based on the idea that we do not need to consider all n = 
610'206 insurance policies for model fitting, and we can still receive accurate 
results. For this we select all insurance policies that have at least 1 claim; in our 
data these are 22’434 insurance policies, we call this data set C= ,. The motivation 
for selecting these insurance policies is that these are exactly the policies that have 
information about the drivers causing claims. These selected insurance policies need 
to be complemented with policies that do not cause any claims. We select at random 
(under-sample) 22’434 insurance policies of drivers without claims, we call this 
data set Cj. Merging the two sets we receive data £L* = £5 U £2, comprising 
44’ 868 insurance policies. This data is balanced from the viewpoint of claim causing 
policies because exactly half of the policies in £* suffers a claim and the other half 
does not. The idea now is to fit a GLM only on this learning data £*, and because 
we only consider 44’ 868 insurance policies the fitting should be fast. 

There is still one point to be considered, namely, in the new learning data £* 
policies with claims are over-represented (because we work in a low frequency 
problem). This motivates that we adjust the time exposures v; in £ġ accordingly 
by multiplying as follows 


n 
ie eee i= Us lw) =0) 
D vjec vj 


Thus, we stretch the exposures of the policies without claims in £*; for our data this 
factor is 26.17. This then provides us with an empirical frequency on £L* of 7.36% 
which is identical to the observed frequency on the entire learning data £. 

We fit model Poisson GLM3 on this reduced (and exposure adjusted) learning 
data £*, the results are presented on the last line of Table 5.6. This model can be 
fitted in 1s, and by construction it fulfills the balance property. The resulting in- 
sample and out-of-sample losses (evaluated on the entire data £ and 7) are very 
close to model Poisson GLM3 which verifies that the model fitted only on the 
learning data £* gives a good approximation. We do not provide AIC because the 
data used is not identical to the data used to fit the other models. The tenfold cross- 


Table 5.6 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses, 
tenfold cross-validation losses (units are in 107°) and in-sample average frequency of the null 
model (intercept model) and of different Poisson GLMs, the last row uses under-sampling in model 
Poisson GLM3 


Run | # In-sample | Out-of-sample | Tenfold CV | Aver. 

time | param. | AIC loss on £ | loss on T loss DCV freq. 
Poisson null - 1 199°506 | 25.213 25.445 25.213 7.36% 
Poisson GLM1 | 16s | 49 192°818 | 24.101 24.146 24.121 7.36% 
Poisson GLM2 | 15s | 48 192°753 | 24.091 24.113 24.110 7.36% 
Poisson GLM3 | 15s | 50 192°716 | 24.084 24.102 24.104 7.36% 


under-sampling | 1s |50 = 24.098 24.108 24.120 7.36% 
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validation loss is a little bit bigger which seems to be a consequence of applying 
the non-stratified version to only 44’868 insurance policies, i.e., this higher cross- 
validation loss shows that we fit the model on less data which provides higher 
uncertainty in model fitting. This finishes this example. 

The presented method is called under-sampling because we under-sample from 
the insurance policies without claims to make both classes (policies with claims and 
policies without claims) equally large. Alternatively, to achieve a class balance we 
could also over-sample from the minority class by duplicating policies. This has a 
similar effect, but it increases run times. Importantly, if we under- or over-sample we 
have to adjust the exposures correspondingly. Otherwise we obtain a biased model 
that is not useful for pricing, the same applies to methods such as the synthetic 
minority oversampling technique (SMOTE) and similar techniques. 

Alternatively, to under-sampling we could also fit a so-called zero-truncated 
Poisson (ZTP) model to the data by only fitting a model on the insurance policies 
that suffer at least one claim, and adjusting the distribution to the observations 
Nil|{N;>1}. This is rather similar to a hurdle Poisson model and we come back to 
this in Example 6.19, below. 


5.3.5 Over-Dispersion in Claim Counts Modeling 


Mixed Poisson Distribution 


In the previous example we have seen that the considered Poisson GLMs do not fully 
fit our data, at least not with the chosen feature engineering, because there is over- 
dispersion in the data (relative to the chosen models). This may give rise to consider 
models that allow for over-dispersion. Typically, such over-dispersed models are 
constructed starting from the Poisson model, because the Poisson model enjoys 
many nice properties as we have seen above. A natural extension is to introduce the 
family of mixed Poisson models, where the frequency is not modeled with a single 
parameter but rather with a whole family of parameters described by an underlying 
mixing distribution. 
In the dual mean parametrization the Poisson distribution for Y = N/v reads as 


-va vA)” 


Y ~ f(y;à,v)=e wy)! 


for y € No/v, 


where the mean parameter is given by à = «’(0) = exp{6}, and 6 denotes the 
canonical parameter; on purpose we use for the mean notation À instead of jz, here, 
the reason will become clear below. This model satisfies for the first two moments 
of N = vY 


ta [N] = vK’ (0) = và and Var, (N) = vk” (0) = và = E [N], 


with dispersion parameter g = 1. A mixed Poisson distribution is obtained 
by mixing/integrating over different frequency parameters A > 0. We choose a 
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distribution z on R+ (strictly positively supported), and define the new distribution 


—vh (va) 
Y=N/v ~ fry; p=] f OA, v) ana) = f e dx(i). 
Ry Ry (vy)! 


(5.36) 


If x is not concentrated in a single point, the tower property immediately implies 


in [N] < Varr (N), (5.37) 


supposed that the moments exist, we refer to Lemma 2.18 in Wüthrich [387]. Hence, 
mixing over different frequency parameters allows us to receive over-dispersion. Of 
course, this concept can also be applied to mixing over the canonical parameter 0 in 
the EF (instead of the mean parameter). 

This leads to the framework of Bayesian credibility models which are widely 
used and studied in actuarial science, we refer to the textbook of Biihlmann—Gisler 
[58]. We have already met this idea in the Bayesian decision rule of Example 3.3 
which has led to the Bayesian estimator in Definition 3.6. 


Negative-Binomial Model 


In the case of the Poisson model, the gamma distribution is a particularly attractive 
mixing distribution for à because it allows for a closed-form solution in (5.36), 
and fr=r(y; v) will be a negative-binomial distribution. One can choose differ- 
ent parametrizations of this mixing distribution, and they will provide different 


scalings in the resulting negative-binomial distribution. We choose the following 


parametrization (A) @ r (va, væ/u) for mean parameter u > 0 and shape 


parameter væ > 0. This implies, see (5.36), 


JNB (Y; H, v, @) f ga AT (a/u) vai pva ugh, 
R+ 


(vy)! Tva) 
T(vy+vaæa) v” (va/m) | 
(vy)!T (væ) (v + vor/pyry tva 


K + va — ‘) (9) (i= ey l 


vy. 


4 The gamma distribution is the conjugate prior to the Poisson distribution. As a result, the posterior 
distribution, given observations, will again be a gamma distribution with posterior parameters, see 
Section 8.1 of Wüthrich [387]. This Bayesian model has been introduced to the actuarial literature 
by Bichsel [32]. 
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setting for canonical parameter 6 = log(u/(u + a)) < 0. This is the negative- 
binomial distribution we have already met in (2.5). A single-parameter linear EDF 
representation is given by, we set unit dispersion parameter ọ = 1, 


’ 


(5.38) 


yO + alog(1 — e?) taro 
—____——_—-_ + log 


Yo NBO: 0, va) = exp I/v 


vy 


where this is a density w.r.t. the counting measure on No/v. The cumulant function 
and the canonical link, respectively, are given by 


k (0) = —alog( — e?) and 60 = h(n) = log (=) € © = (—œ, 0). 
ta 


Note that a > 0 is treated as nuisance parameter (which is a fixed part of the 
cumulant function, here). The first two moments of the claim count N = vY are 
given by 


vu = E[N] = va—— (5.39) 


e? U 
Varg (N) = tw (1+ 5) = Zo] (1+ =) > Eg[N]. (5.40) 
=e a 


This shows that we receive a fixed over-dispersion of size jz/a, which (in this 
parametrization) does not depend on the exposure v; this is the reason for choosing 


a mixing distribution 7 (A) 2 T (væ, va/). This parametrization is called NB2 
parametrization. 


Remarks 5.21 


e We emphasize that the effective domain © = (—ov, 0) is one-sided bounded. 
Therefore, the canonical link for the linear predictor will not work in general 
because the linear predictor x +» n(x) can be both-sided unbounded in a GLM 
setting. Instead, we use the log-link for g(-) in our example below, with the 
downside that one loses the balance property. 

e The unit deviance in this negative-binomial EDF model is given by 


Oni o, a) =2[ ytog (2) - (y+ aytog (222) ], 
H H 


+a 


we also refer to Table 4.1 for a = 1. We emphasize that this is the unit deviance 
in a single-parameter linear EDF, and we only aim at estimating canonical 
parameter 9 € © and mean parameter u € M, respectively, whereas œ > 0 is 
treated as a given nuisance parameter. This is important because the unit deviance 
relies on the saturated model which, in general, estimates a one-dimensional 
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parameter 0 and m, respectively, from the one-dimensional observation Y. The 
nuisance parameter is not affected by the consideration of the saturated model, 
and it is treated as a fixed part of the cumulant function, which is not estimated 
at this stage. An important consequence of this is that model comparison using 
deviance residuals only works for identical nuisance parameters. 

e We mention that we receive over-dispersion in (5.40) though we have dispersion 
parameter g = | in (5.38). Alternatively, we could do the duality transformation 
y +» ¥ = y/a for nuisance parameter a > 0; this gives the reproductive form of 
the negative-binomial model NB2, see also Remarks 2.13. This provides us with 
a density on No/(va), set @ = 1/a, 


58 + log(1 — e?) ‘oe va — I 
—— + log 2 


1/(va) 5 


P ~ fans, vi) = exp] 


The cumulant function and the canonical link, respectively, are now given by 
k(0)=-—log(1 — e°) and 8 =h) = log (4) e O=(-~, 0). 
w+ 
The first two moments are for 0 € © given by 


0 


1— e?’ 


i = Eo[Y] = 


a 1 2. pe 
Varo = © (0) = =a +D. 
v Va 


Thus, we receive the reproductive EDF representation with dispersion parameter 
@ = 1/a and variance function V (A) = A(1 + A). Moreover, N = vY = vaY. 
e The negative-binomial model with the NB1 parametrization uses the mixing 


distribution (A) 2 T'(uv/a, v/a). This leads to mean Eg[N] = vu and 
variance Varg(N) = Ee[N](1 + œ). In this parametrization, u enters the gamma 
function as I (uv/æ) in the gamma density which does not allow for an EDF 
representation. This parametrization has been called NB1 by Cameron-Trivedi 
[63] because both terms in the variance Varg(N) = vu + vua are linear in u. In 
contrast, in the NB2 parametrization the second term has a square vu? /« in p, 
see (5.40). Further discussion is provided in Greene [171]. 


Nuisance Parameter Estimation 


All previous statements have been based on the assumption thata > 0 is a 
given nuisance parameter. If a needs to be estimated too, then, we drop out 
of the EF. In this case, an iterative estimation procedure is applied to the EDF 
representation (5.38). One starts with a fixed nuisance parameter w©) and fits the 
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é ; z : >a 
negative-binomial GLM with MLE which provides a first set of MLE B D 

zad , i i ; 
B (a), Based on this estimate the nuisance parameter is updated a +> a) by 


E es Pare : . wie! : i 
maximizing the log-likelihood in œ for given B ’ Iteration of this procedure then 
leads to a joint estimation of regression parameter 8 and nuisance parameter œ. Both 
MLE steps in this algorithm increase the joint log-likelihood. 


Remark 5.22 (Implementation of the Negative-Binomial GLM in R) Implementa- 
tion of the negative-binomial model needs some care. There are two R procedures 
glm and glm. nb that can be used to fit negative-binomial GLMs, the latter being 
built on the former. The procedure glm is just the classical R procedure [307] that 
is usually used to fit GLMs within the EDF, it requires to set 


family=negative.binomial (theta, link="log"). 


This parametrization considers the single-parameter linear EF on N (for mean u € 


M) 


theta—1 n theta 
Jne(n; u, theta) = RS — he ae , 
n u + theta u+ theta 


where theta > 0 denotes the nuisance parameter. The tricky part now is that we 
have to bring in the different exposures v; of all policies 1 < i < n. That is, we 
would like to have for claim counts n; = vi yj, see (5.38), 


viyi tuja — 1 Vi [Lj viyi V; L: via 
fantasti) = ( iYi + vi (z) (--) 
Ui yi Ui fi + vi& Uj i + Vi 


4 ( li i (1- li jT 
vi yi Mi +a Hi ta l 


The square bracket can be implemented in glm as a scaled and weighted regression 
problem, see Listing 5.8 with theta = a. This approach provides the correct GLM 


parameter estimates pT for given a, however, the outputted AIC values cannot 
be compared to the Poisson case. Note that the Poisson case of Table 5.5 considers 
observations N; whereas Listing 5.8 uses Y; = Nj; /v;. For this reason we calculate 
the log-likelihood and AIC by an own implementation. 

The same remark applies to glm.nb, and also nuisance parameter estimation 
cannot be performed by that routine under different exposures v;. Therefore, we 
have implemented an iterative estimation algorithm ourselves, alternating glm of 
Listing 5.8 for given œ and a maximization routine optimize to find the optimal 
a for given £ using (5.38). We have applied this iteration in Example 5.23, below, 
and it has converged in 5 iterations. 


Example 5.23 (Negative-Binomial Distribution for Claim Counts) We revisit the 
MTPL claim frequency GLM example of Sect. 5.3.4, but we replace the Poisson 
distribution by the negative-binomial one. We start with the negative-binomial (NB) 
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Listing 5.8 Implementation of model NB GLM3 


d.glmnb <- glm(ClaimNb/Exposure ~ VehPowerGLM + VehAgeGLM 
+ log(DrivAge) + I(DrivAge*3) + I(DrivAge*4) 
+ BonusMalusGLMs*DrivAge + BonusMalusGLM«I (DrivAge*2) 
+ VehBrand + VehGas + DensityGLM + Region + AreaGLM, 
data=learn, weights=Exposure, 
family=negative.binomial (alpha, link="log") ) 


Table 5.7 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses 
(units are in 107?) and in-sample average frequency of the null models (Poisson and negative- 
binomial) and the Poisson and negative-binomial GLMs. The optimal model is highlighted in 
boldface 


Run | # In-sample | Out-of-sample | Aver. 

time | Param. | AIC loss on £ | loss on T freq. 
Poisson null - 1 199°506 | 25.213 25.445 7.36% 
Poisson GLM3 15s |50 192°716 |24.084 |24.102 7.36% 
NB null @MIE = 1.059 |- 2 198°466 | 20.357 | 20.489 | 7.36% 
NB null@Ny” = 1.810 |- 1 198°564 | 21.796 | 21.948 7.36% 
NB GLM3 @MLE = 1.810 | 85s | 51 1927113 | 20.722 _| 20.674 7.38% 


ANB 


null model. The NB null model has two parameters, the homogeneous (overall) 
frequency and the nuisance parameter. MLE of the homogeneous overall frequency 
is identical to the one in the Poisson null model, and MLE of the nuisance parameter 
provides a = 1.059. This is substantially smaller than infinity and suggests 
over-dispersion. The results are presented on the third line of Table 5.7. We observe 
a smaller AIC of the NB null model against the Poisson null model which says that 
we should allow for over-dispersion. 

We now focus on the NB GLM. The feature pre-processing is done exactly as 
in model Poisson GLM3, and we choose the log-link for g. We call this model 
NB GLM3. The iterative estimation procedure outlined above provides a nuisance 
parameter estimate a = 1.810. This is bigger than in the NB null model because 
the regression structure explains some part of the over-dispersion, however, it is 
still substantially smaller than infinity which justifies the inclusion of this over- 
dispersion parameter. 

The last line of Table 5.7 gives the result of model NB GLM3. From AIC we 
conclude that we favor the negative-binomial GLM over the Poisson GLM since 
AIC decreases from 192’716 to 192’ 113. The in-sample and out-of-sample deviance 
losses can only be compared within the same models, i.e., the models that have the 
same cumulant function. This also applies to the negative-binomial models which 
have cumulant function «(@) = —qalog(1 — e°). Thus, to compare the NB null 
model and model NB GLM3, we need to choose the same nuisance parameter g. 
For this reason we added this second NB null model to Table 5.7. This second NB 
null model no longer uses the MLE ol, therefore, the corresponding AIC only 


includes one estimated parameter. 
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Fig. 5.7 Poisson logged Poisson vs. NB linear predictors 
predictors 
vs. negative-binomial logged 
predictors 


negative binomial linear predictors 


Poisson linear predictors 


Table 5.8 Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted 
in boldface 


Poisson NB deviance NB deviance 
Model deviance @MLE = 1059 ole = 1.810 
Null model 125.445 | 20.489 | 21.948 
Poisson GLM3 | 24.102 | 19.266 20.678 
NB GLM3 @MLE = 1.810 | 24.100 | 19.262 20.674 


ANB = 


As mentioned above, deviance losses can only be compared under exactly the 
same cumulant function (including the same nuisance parameters). If we want to 
have a more robust model selection, we can consider forecast dominance according 
to Definition 4.20. Being less ambitious, here, we consider forecast dominance 
only for the three considered cumulant functions Poisson, negative-binomial with 
QMLE = 1.059 and negative-binomial with aM" = 1.810. The out-of-sample 
deviance losses are given in Table 5.8 in the different columns. According to this 
forecast dominance analysis we also give preference to model NB GLM3, but model 
Poisson GLM3 is pretty close. 

Figure 5.7 compares the logged predictors log(ji;), 1 < i < n, of the models 
Poisson GLM3 and NB GLM3. We see a huge similarity in these predictors, only 
high frequency policies are judged slightly differently by the NB model compared 
to the Poisson model. 

Table 5.9 gives the predicted number of claims against the observed ones. We 
observe that model NB GLM3 predicts more accurately the number of policies with 
2 or less claims, but it over-estimates the number of policies with more than 2 claims. 


This may also be related to the fact that the estimated in-sample frequency has a 


OMAINDMPWNK 
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Table 5.9 Contingency table of observed number of policies against predicted number of policies 
with given claim counts ClaimNb 


| Numbers of claims ClaimNb 


10 |1 2 3 |4 {5 

Observed number of policies [587772 |2198 |r174 | 57 |4 |1 
_Poisson predicted number of policies | 587325 | 22064 719 34 | 3 [0.3 

NB predicted number of policies [587902 |20982 |1200 |100 |15 |4 


positive bias in model NB GLM3, see Table 5.7. That is, since we do not work with 
the canonical link, we do not have the balance property. 


Listing 5.9 drop1 analysis of model NB GLM3 
Single term deletions 
Model: 


ClaimNb/Exposure ~ VehPowerGLM + VehAgeGLM + DrivAge + log(DrivAge) + 
I(DrivAge*2) + I(DrivAge*3) + I(DrivAge*4) + BonusMalusGLM «* 


DrivAge + BonusMalusGLM + I (DrivAge*2) + BonusMalusGLM + 
VehBrand + VehGas + DensityGLM + Region + AreaGLM 
Df Deviance AIC scaled dev. Pr(>Chi) 

<none> 126446 171064 
VehPowerGLM 5 126524 171102 48.266 3.134e-09 xxx 
VehAgeGLM 2 126655 171190 130.070 < 2.2e-16 *«*x* 
log (DrivAge) 1 126592 171153 91.057 e 2.2€-16 xxx 
I (DrivAge*3) 1 126527 171112 50.483 1.202e-12 «x«* 
I (DrivAge*4) 1 126508 171100 38.381 5.820e€-10 *** 
VehBrand 10 126658 171176 132.098 < 2.2e-16 +#% 
VehGas 1 126583 171147 85.232 < 2.2e€-16 xxx 
DensityGLM 1 126456 171068 62137 0.01324 «* 
Region 21 126622 171132 109.838 5.042e-14 *** 
AreaGLM 1 126450 171064 2.411 0.12049 
DrivAge:BonusMalusGLM 1 126484 171085 23.481 1.262e-06 xxx 
I(DrivAge*2) :BonusMalusGLM 1 126490 171089 27.199 1.836e-07 «xx 


Signif. codes: O *** 0.001 ** 0.01 * 0.05 . 0.1 ak 


We close this example by providing the drop1 analysis in Listing 5.9. From 
this analysis we conclude that the feature component Area should be dropped. 
Of course, this confirms the high collinearity between Density and Area which 
implies that we do not need both variables in the model. We remark that the AIC 
values in Listing 5.9 are not on our scale, as stated in Remark 5.22. a 


5.3.6 Zero-Inflated Poisson Model 


In many applications it is the case that the Poisson distribution does not fully fit 
the claim counts data because there are too many policies with zero claims, i.e., 
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policies with Y = 0, compared to a Poisson assumption. This topic has attracted 
some attention in the recent actuarial literature, see, e.g., Boucher et al. [43-45], 
Frees et al. [137], Calderin-Ojeda et al. [62] and Lee [239]. An obvious solution to 
this problem is to ‘artificially’ increase the probability of a zero claim compared to 
a Poisson model, this is the proposal introduced by Lambert [232]. Y has a zero- 
inflated Poisson (ZIP) distribution if the probability weights of Y are given by (set 
v=1) 


to +(1—ao)e" for y = 0, 


: 0, = iy 
fay; 9, 70) EAT ua forye NW, 


for mo € (0,1), u = e? > O, and for the Poisson probability weights we refer 
to (2.4). For zo > O the weight of a zero claim Y = O is increased (inflated) 
compared to the original Poisson distribution. 


Remarks 5.24 


e The ZIP distribution has different interpretations. It can be interpreted as a 
hierarchical model where we have a latent variable Z which indicates with 
probability zo that we have an excess zero, and with probability 1 — zo we have 
an ordinary Poisson distribution, i.e. for y € No 


1iy=0} forz = 0, 


Po [Y = y| Z =z] = (5.41) 


-u we — 
e AST forz = 1, 


with P[Z = 0] = 1 — P[Z = 1] = xo. 

The latter shows that we can also understand it as a mixture of two distribu- 
tions, namely, of the Poisson distribution and of a single point measure in y = 0 
with mixing probability zo. Mixture distributions are going to be discussed in 
Sect. 6.3.1, below. In this sense, we can also interpret the model as a mixed 
Poisson model with mixing distribution 7(A) being a Bernoulli distribution 
taking values 0 and u with probability zo and 1 — zo, respectively, see (5.36), 
and the former parameter à = 0 leads to a degenerate Poisson model. 

e We have introduced the ZIP model, but this approach is neither limited to the 
Poisson model nor the zeros. For instance, we could also consider an inflated 
negative-binomial model where both the zeros and the ones are inflated with 
probabilities xo, 7; > 0 such that xo + 7 < 1. 

e Hurdle models are an alternative way to model excess zeros. Hurdle models 
have been introduced by Cragg [83], and they also allow for too little zeros. 
A hurdle (Poisson) model mixes a lower-truncated (Poisson) count distribution 
with a point mass in zero 


TO for y = 0, 
Jhurdle Poisson (Y; 8, 70) = emn e} (5.42) 
(d — zo) fory EN, 
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for zo € (0, 1) and u > 0. For zp > e`” the weight of a zero claim is increased 
and for zo < e “ it is decreased. This distribution is called a hurdle distribution, 
because we first need to overcome the hurdle at zero to come to the Poisson 
model. Lower-truncated distributions are studied in Sect. 6.4, below, and mixture 
distributions are discussed in Sect.6.3.1. In general, fitting lower-truncated 
distributions is challenging because the density and the distribution function 
should both have tractable forms to perform MLE for truncated distributions. 
The Expectation-Maximization (EM) algorithm is a useful tool to perform 
model fitting under truncation. We come back to the hurdle Poisson model in 
Example 6.19, below, and it is also closely related to the zero-truncated Poisson 
(ZTP) model discussed in Remarks 6.20. 


The first two moments of a ZIP random variable Y ~ fzrp(-; 0, 770) are given by 


%0 ml Y] = (1 — xo)y, 


Varg no (Y) = (1 — mo)u + (to — m6) Mu? = Eo m[Y] (1 + ron), 


these calculations easily follow with the latent variable Z interpretation from above. 
As a consequence, we receive an over-dispersed model with over-dispersion mou 
(the latter also follows from the fact that we consider a mixed Poisson distribution 
with a Bernoulli mixing distribution having weights zp in 0 and | — xo in u > 0, 
see (5.37)). 

Unfortunately, MLE does not allow for explicit solutions in this model. The score 


equations of Y; ue fz; 0, 10) are given by 


n 


Vero u) EY (700, U) = Yoro, u) 5 log (x0 +a - moje “) Ly, =0} 
i=1 


n y 
u W 
+ Vero.y) 2log (a — Moje ne) Iyy,>0} = 0. 


i=l 


The R package psc1 [401] has a function called zeroinf1 which uses the general 
purpose optimizer optim to find the MLEs in the ZIP model. Alternatively, we 
could explore the EM algorithm for mixture distributions presented in Sect. 6.3, 
below. 

In insurance applications, the ZIP application can be problematic if we have 
different exposures v; > O for different insurance policies i. In the Poisson GLM 
case with canonical link choice we typically integrate the different exposures into 
the offset, see (5.27). However, it is not clear whether and how we should integrate 
the different exposures into the zero-inflation probability zo. It seems natural to 
believe that shorter exposures should increase 7p, but the explicit functional form of 
this increase can be debated, some options are discussed in Section 5 of Lee [239]. 


NADU WNeE 
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Listing 5.10 Implementation of model ZIP GLM3 


d.ZIP <- zeroinfl(ClaimNb ~ VehPowerGLM + VehAgeGLM 
+ log(DrivAge) + I(DrivAge*3) + I (DrivAge*4) 
+ BonusMalusGLMs*DrivAge + BonusMalusGLMsI (DrivAge*2) 
+ VehBrand + VehGas + DensityGLM + Region 
+ AreaGLM | 1, 
data=learn, offset=log(Exposure), dist=’poisson’, link=’logit’, 
start=list (count=glm3$coefficients, zero=c(-0.4153)) ) 


Table 5.10 Run times, number of parameters, AICs, in-sample and out-of-sample deviance 
losses (units are in 107°) and in-sample average frequency of the null models (Poisson, negative- 
binomial and ZIP) and the Poisson, negative-binomial and ZIP GLMs. The optimal model is 
highlighted in boldface 


Run | # | AIC In-sample | Out-of-sample | Aver. 

time | Param. loss on £ | loss on T freq. 
Poisson null - 1 199°506 | 25.213 25.445 7.36% 
Poisson GLM3 15s |50 192°716 | 24.084 24.102 7.36% 
NB null @MEE = 1.059 |- 2 198°466 | 20.357 20.489 7.36% 
NB null ANF = 1.810 = 1 198564 |21.796 21.948 7.36% 
NB GLM3 @\ig™ = 1.810 |85s |51 1927113 |20.722 | 20.674 7.38% 
ZIP null 20s | 2 198°638 |- = 743% 
ZIP GLM3 (null 70) 270s | 51 192393 |- a 7.37% 


In the following application, we simply choose zp independent of the exposures, but 
certainly this is not the best modeling choice. 


Example 5.25 (ZIP Model for Claim Counts) We revisit the MTPL claim frequency 
example of Sect. 5.3.4, but this time we fit a ZIP model. For the Poisson part we 
use exactly the same GLM regression function as in model Poisson GLM3 and, 
in particular, we use for the different exposures v; of the insurance policies the 
offset term o; = log v;, see line 6 of Listing 5.10. This offset only acts on the 
Poisson part of the ZIP GLM. The zero-inflating probability zo is modeled with a 
logistic Bernoulli model, see Sect. 2.1.2. For computational reasons, we choose the 
null model for the Bernoulli part modeling the zero-inflation 29. This is indicated 
by the “1” on line 5 of Listing 5.10. This 1 should be expanded if we also want to 
consider a regression model for the zero-inflating probability zo and, in particular, 
if we want to integrate an offset term for the exposure. We can set this term to 
offset (f), where f is a suitable transformation of the exposure. Furthermore, 
successful calibration requires meaningful starting values, otherwise zeroinfl 
will not find the MLEs. We start the algorithm in the parameters of model Poisson 
GLM3, see line 7 of Listing 5.10. The results are presented in Table 5.10. 

Firstly, we see that the run times are not fully competitive in this implementation, 
even if we choose the null model for the zero-inflating probability zo, i.e., only 
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Table 5.11 Out-of-sample deviance losses: forecast dominance. The optimal model is highlighted 
in boldface 


Poisson NB deviance NB deviance 
Model deviance @MLE = 1.059 @MLE = 1.810 
Null model 25.445 20.489 21.948 
Poisson GLM3 24.102 19.266 20.678 
NB GLM3 QMLE = 1.810 24.100 19.262 20.674 
ZIP null model | 25.446 | 20.490 21.949 
ZIP GLM3 | 24.103 19.267 20.679 


Table 5.12 Contingency table of observed numbers of policies against predicted numbers of 
policies with given claim counts ClaimNb 


Numbers of claims ClaimNb 


0 1 2 3 4 5 
Observed number of policies 5877772 |21°198 |1174 57 
Poisson predicted number of policies | 5877325 | 22’064 |779 34 3 0.3 
NB predicted number of policies 587°902 |20982 |1200 | 100 15 4 
ZIP predicted number of policies 5877829 | 21’094 |1191 79 9 4 


one intercept parameter is involved for determining zg. Secondly, in this model we 
cannot calculate deviance losses because the saturated model has two parameters for 
each observation. Thirdly, the model does not satisfy the balance property though we 
work with the canonical links for the Poisson part and the Bernoulli part, however, 
this property gets lost under the combination of these two model parts. 

Most interesting are the AIC values. We observe that the ZIP GLM improves the 
Poisson GLM, but it has a bigger AIC value than the negative-binomial GLM. From 
this we conclude that we give preference to the negative-binomial model in our case. 

Considering forecast dominance according to Definition 4.20, but restricted to 
the three deviance losses studied in Example 5.23, we receive Table 5.11. Also this 
table gives preference to the negative-binomial GLM. However, if we consider the 
table of the observed numbers of policies against the predicted numbers of claims, 
see Table 5.12, we give preference to the ZIP GLM because it has the lowest x- 
value, i.e., it reflects best (in-sample) our observations. 

Figure 5.8 compares the resulting predictors on the log-scale. From this plot we 
conclude that in our example the predictors of the ZIP GLM are closer to the Poisson 
ones than the NB GLM predictors. In a next step, one could refine the zero-inflating 
probability zo modeling by integrating the exposure and further feature information. 
This would lead to a further model improvement. We refrain here from doing so and 
close this example; in Example 6.19, below, we study the hurdle Poisson model. m 
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Fig. 5.8 Comparison linear Poisson vs. NB linear predictors 
predictors of the NB and ZIP 

GLMs against the ones of the ~ 
Poisson GLM 


NB and ZIP linear predictors 


° NB GLM 
° ZIP GLM 
TyF -r FUE 
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5.3.7 Lab: Gamma GLM for Claim Sizes 


As a second example we consider claim size modeling within GLMs. For this 
example we do not use the French MTPL claims data because the empirical 
density plot in Fig. 13.15 indicates that a GLM will not fit to that data. The French 
MTPL data seems to have three distinct modes, which suggests to use a mixture 
distribution. Moreover, the log-log plot indicates a regularly varying tail, which 
cannot be captured by the EDF on the original observation scale; we are going 
to study this data in Example 6.14, below. Here, we use the Swedish motorcycle 
data, previously used in the textbook of Ohlsson—Johansson [290] and described in 
Chap. 13.2. From Fig.5.9 we see that the empirical density has one mode, and the 
log-log plot supports light tails, i.e., the gamma model might be a suitable choice for 
this data. Therefore, we choose a gamma GLM with log-link g. As described above, 
the log-link is not the canonical link for the gamma EDF distribution but it ensures 
the right sign w.r.t. the linear predictor n; = (8, xi). Working with the log-link in 
the gamma model will imply that the balance property is not fulfilled. 


empirical density of average claim amounts empirical distribution of average claim amounts log-log plot of average claim amounts 
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Fig. 5.9 (lhs) Empirical density, (middle) empirical distribution and (rhs) log-log plot of claim 
amounts of the Swedish motorcycle data presented in Chap. 13.2 
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Feature Engineering 


We have 4 continuous feature components OwnerAge, RiskClass, VehAge and 
BonusClass, one binary feature component Gender and a categorical compo- 
nent Area, see Listing 13.4. We have decided for a minimal feature engineering; we 
refer to Figs. 13.19 (rhs) and 13.20 (rhs) for descriptive plots. We use the continuous 
variables directly in a log-linear fashion, we add quadratic terms for OwnerAge and 
VehAge, we merge RiskClass 6 and 7, and we censor VehAge at 20. Area 
is categorical, but we may interpret the Zone levels as ordinal categorical, and 
mapping them to integers allows us to use them in a continuous fashion; Fig. 13.19 
(middle row, rhs) shows that this is a reasonable choice. Moreover, we merge Zone 
5, 6 and 7 due to small volumes and their similar behavior. 


Gamma Generalized Linear Model 


The Swedish motorcycle claim amount data poses the special difficulty that we 
do not have individual claim observations Z; j, but we only know the total claim 
amounts $; = Da 1 Zi,j and the number of claims N; on each insurance policy; 
Fig. 5.9 shows average claims S; /N; of insurance policies i with N; > 0. In general, 
this poses a problem in statistical modeling, but in the gamma model this problem 
can be handled because the gamma distribution is closed under aggregation of 
i.i.d. gamma claims Z;, j. In all what follows in this section, we only study insurance 
policies with N; > 0, and we label these insurance policies i accordingly. 

Assume that Z; j are i.i.d. gamma distributed with shape parameter a; and scale 
parameter c;, we refer to (2.6). The mean, the variance and the moment generating 
function of Z; j are given by 


: Qj Qj a y” 
Zal=Ž, VaZ)=% ad Mz, (7) = , 
Cc 


i CoH Fr 


l 

(5.43) 
where the moment generating function requires r < c; to be finite. Assuming that 
the number of claims N; is a known positive integer ni € N, we see from the 
moment generating function that S; = ei Zj,; is again gamma distributed with 
shape parameter n;œ; and scale parameter c;. We change the notation from N; to 
ni to emphasize that the number of claims is treated as a known constant (and 
also to avoid using the notation of conditional probabilities, here). Finally, we scale 
Y; = Si/(niai) ~ T (ni&i, niaic;). This random variable Y; has a single-parameter 
EDF gamma distribution with weight v; = ni, dispersion g = 1/a; and cumulant 
function x (6;) = — log(—6;), for 6; € © = (—oo, 0), 


| y; — x (0i) 

exp } ——— 
Qi/vi 

(—0ia; vi)” Ai 


viæi— 1 
— SS A —(—6;a; v; A 
Twa) y exp {—(—6;q; vi) y} 


Yi ~ f(y; 0i, vi/pi) 


+a(y; v/o} (5.44) 
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and the canonical parameter is 6; = —c;. For our GLM analysis we treat the shape 
parameter œ; = œ > 0 as a nuisance parameter that does not depend on the specific 
policy i, i.e., we set constant dispersion g = 1/a, and only the scale parameter c; is 
chosen policy dependent through 6; = —c;. 

Random variable Y; = S;/(nja) ~ T'(nj;a, njac;) gives the reproductive form 
of the gamma EDF, see Remarks 2.13. In applications, this form is not directly 
useful because under unknown shape parameter œ, we cannot calculate observations 
Y; = S;/(n;a). For this reason, we parametrize the model differently, here. We 
consider instead 


Y; = Si/ni ~ (nia, nici). (5.45) 
This (new) random variable has the same gamma EDF (5.44), we only need to 


reinterpret the canonical parameter as 6; = —c;/a. Then, we choose the log-link 
for g which implies 


1 
ui = Eg; YAH (0i) = = exp{ni} = exp(B, xi), 


l 


if x; € X C RIH! describes the pre-processed features of policy i. The gamma 
GLM is now fully specified and can be fitted to the data; from Example 5.5 we 
know that we have a concave maximization problem. We call this model Gamma 
GLM! (with the feature pre-processing as described above). Note that the (constant) 
dispersion parameter ø cancels in the score equations, thus, we do not need to 


explicitly specify the nuisance parameter «œ to estimate regression parameter B € 
RIH, 


Maximum Likelihood Estimation and Model Selection 


Because we have only few claims data in this Swedish motorcycle example (only 
m = 656 insurance policies suffer claims), we do not perform a generalization 
analysis with learning and test samples. In this situation we need all data for 
model fitting, and model performance is analyzed with AIC and with tenfold cross- 
validation. 

The in-sample deviance loss in the gamma GLM is given by 


2 ni (Yi — Ai) Yi 
DLC, RO) =—) — (S — lo (= )) ; (5.46) 
m 2 p U(xi) E | fie 
where i runs over the policies i = 1,...,m with positive claims Y; = S;/n; > 0, 


and fi(x;) = expa, xi) is the MLE estimated regression function. Similar to 
the Poisson case (5.29), McCullagh—Nelder [265] derive the following behavior 
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Fig. 5.10 (lhs) Empirical density of Y; and (rhs) empirical density of y” 3 


for the gamma unit deviance around its mode, see Section 7.2 and Figure 7.2 in 
McCullagh—Nelder [265], 


2 
di, ui) ~ 9¥2 (r ae") (5.47) 


L I l 
this uses that the log-likelihood is symmetric around its mode for scale u” - see 
Fig. 5.5 (middle). This shows that the gamma deviance scales differently around Y; 
compared to the square loss function. From this we receive an approximation to the 
deviance residuals (for v/g = 1) 


y,\'3 ee 
rP = sign(Y; — wi) V0 (Yi, wi) © 3 (=) =1)=3- 1/3 
L Mi 
(5.48) 


This is the cube-root transformation derived by Wilson—Hilferty [383]. This sug- 
gests that if the empirical distribution of Y. me ? looks roughly Gaussian we can use a 
gamma distribution. Figure 5.10 gives the empirical densities of Y; on the left-hand 
side and of y! > on the right-hand side. The latter looks roughly Gaussian (except 
of the second mode close to 4), this supports the use of a gamma model. 

Listing 5.11 provides the summary statistics of the fitted model Gamma GLM1; 
note that we integrate the number of claims n; through scaling into the weights. 
We have q + 1 = 9 regression parameters, and from this summary statistics we 
observe that not all variables should be kept in the model. If we perform backward 
elimination using drop] in each step, see Sect. 5.3.3, we first drop BonusClass 
and then Gender, resulting in a reduced model with 7 parameters. We call this 
reduced model Gamma GLM2. 


COmMAAIDMPWNK 
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Listing 5.11 Results in model Gamma GLM1 using the R command glm 


Call; 

glm(formula = ClaimAmount/ClaimNb ~ OwnerAge + I (OwnerAge^2) + 
AreaGLM + RiskClass + VehAge + I (VehAge*2) + Gender + BonusClass, 
family = Gamma (link = "log"), data = mcdata0, weights = ClaimNb) 


Deviance Residuals: 
Min 10 Median 30 Max 
-3.3683 -1.4585 -0.5979 0.4354 3.4763 


Coefficients: 

Estimate Std. Error t value Pr(>!t!) 
(Intercept) 8.9737854 0.5532821 16.219 < 2e-16 xxx 
OwnerAge 0.1072781 0.0280862 3.820 0.000147 xxx 
I (OwnerAge*2) -0.0014508 0.0003489 -4.158 3.65e€-05 «xx 
AreaGLM -0.0768512 0.0368284 -2.087 0.037303 «* 
RiskClass 0.0615575 0.0327553 1.879 0.060651 
VehAge -0.2051148 0.0296184 -6.925 1.05e-11 «xx 
I (VehAge*2) 0.0062649 0.0015946 3.929 9.456-05 xxx 
GenderMale 0.1085538 0.1673443 0.649 0.516772 
BonusClass 0.0089004 0.0225371 0.395 0.693029 
Signif. codes: O Skee? 0.001 “eet OOL te’ O05 6.2% O22. % 7 2 


(Dispersion parameter for Gamma family taken to be 1.536577) 


Null deviance: 1368.0 on 655 degrees of freedom 
Residual deviance: 1126.5 on 647 degrees of freedom 
AIC: 14922 


Number of Fisher Scoring iterations: 11 


Table 5.13 Run times, number of parameters, AICs, Pearson’s dispersion estimate, in-sample 
losses, tenfold cross-validation losses and the in-sample average claim amounts of the null model 
(gamma intercept model) and the gamma GLMs 


Run | # AIC Dispersion | In-sample | Tenfold CV | Average 

| time | Param. | est. Q? | loss on £ | loss ACV amount 
Gammanull |- [1+1 [1#416 |2.057 [2.085 [2.091  [2#64 
Gamma GLM1 |1s [9+1 | 14°277 |1.537 |1717 [|1752 [25105 
Gamma GLM2 |1s [7+1 | 14°274 |1.544 [L79 |1747 [25130 


The results of models Gamma GLM1 and Gamma GLM2 are presented in 
Table 5.13. We show AICs, Pearson’s dispersion estimate, the in-sample deviance 
losses on all available data, the corresponding tenfold cross-validation losses, and 
the average claim amounts. 

Firstly, we observe that the GLMs do not meet the balance property. This is 
implied by the fact that we do not use the canonical link to avoid any sort of difficulty 
of dealing with the one-sided bounded effective domain © = (—oo, 0). For pricing, 
the intercept parameter AMLE should be shifted to eliminate this bias, i.e, we need to 
shift this parameter under the log-link by — log(25’130/24'641) for model Gamma 
GLM2. 

Secondly, the in-sample and tenfold cross-validation losses are not directly 
comparable to AIC. Observe that we need to know the dispersion parameter g in 
order to calculate both of these statistics. For the in-sample and cross-validation 
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losses we have set g = 1, thus, all these figures are directly comparable. For AIC 
we have estimated the dispersion parameter g with MLE. This is the reason for 
increasing the number of parameters in Table 5.13 by +1. Moreover, the resulting 
AICs differ from the ones received from the R command glm, see, for instance, 
Listing 5.11. The AIC value in Listing 5.11 does not consider all terms appropriately 
due to the inclusion of weights, this is similar to Remark 5.22, it uses the 
deviance dispersion estimate Pp, i.e., not the MLE and (still) increases the number 
of parameters by 1 because the dispersion is estimated. For these reasons, we have 
implemented our own code for calculating AIC. Both AIC and the tenfold cross- 
validation losses say that we should give preference to model Gamma GLM2. 
The dispersion estimate in Listing 5.11 corresponds to Pearson’s estimate 


l a i = fi)? 
AP i i 
_ . , 5.49 
á m—(q+1) 2 MA pam) 


i=l l 


We observe that the dispersion estimate is roughly 1.5 which gives an estimate of 
the shape parameter a = 1/ọ of 2/3. A shape parameter less than | implies that the 
density of the gamma distribution is strictly decreasing, see Fig. 2.1. Often this is a 
sign that the model does not fully fit the data, and if we use this model for simulation 
we may receive too many observations close to zero compared to the true data. 
A shape parameter less than 1 may be implied by more heterogeneity in the data 
compared to what the chosen gamma GLM allows for or by large claims that cannot 
be explained by the present gamma density structure. Thus, there is some sign here 
that the data is more heavy-tailed than our model choice suggests. Alternatively, 
there might be some need to also model the shape parameter with a regression 
model; this could be done using the vector-valued parameter EF representation of 
the gamma model, see Sect. 2.1.3. In view of Fig. 5.10 (rhs) it may also be that 
the feature information is not sufficient to describe the second mode in 4, thus, we 
probably need more explanatory information to reduce dispersion. 

In Fig. 5.11 we give the Tukey—Anscombe plot and a QQ plot. Note that the 
observations for n; = 1 follow a gamma distribution with shape parameter œ 
and scale parameter ci = a/uj = —a6;. Thus, if we scale Y;/ui, we receive 
iid. gamma random variables with shape and scale parameters equal to a. This 
then allows us for n; = 1 to plot the empirical distribution of Y;/jz; against T (œ, œ) 
in a QQ plot where we estimate |/a by Pearson’s dispersion estimate. The Tukey— 
Anscombe plot looks reasonable, but the QQ plot shows that the gamma model 
does not entirely fit the data. From this plot we cannot conclude whether the gamma 
distribution is causing the problem or whether it is a missing term in the regression 
structure. We only see that the data is over-dispersed, resulting in more heavy-tailed 
observations than the theoretical gamma model can explain, and a compensation 
by too many small observations (which is induced by over-dispersion, i.e., a shape 
parameter smaller than one). In the network chapter we will refine the regression 
function, keeping the gamma assumption, to understand which modeling part is 
causing the difficulty. 


Remark 5.26 For the calculation of AIC in Table 5.13 we have used the MLE of the 
dispersion parameter g. This is obtained by solving the score equation (5.11) for the 
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Tukey-Anscombe plot: fitted Gamma GLM2 QQ plot: fitted Gamma GLM2 


deviance residuals 
o 
observed values 


-2 


-3 


80 85 90 95 100 10.5 11.0 
fitted means (log—scale) theoretical values 


Fig. 5.11 (lhs) Tukey—Anscombe plot of the fitted model Gamma GLM2, and (rhs) QQ plot of 
the fitted model Gamma GLM2 


gamma case. It is given by, we set a = 1/ọ and we calculate the MLE of « instead, 


a n 
5 Yb: a) = > vi (Yih) — k (h(u(x:i))) + log Y; + log(ævi) + 1 — Y(av;)] =0, 
a 


i=1 


where Y(«) = I’(a)/T'(a) is the digamma function. We calculate the second 
derivative w.r.t. œ, see also (2.30), 


a . 1 1 z 2 1 1 

za Bo = 2 [i — v; Y (an)| = dt [5 — (au)| <0 fora > 0, 
the negativity follows from Theorem 1 in Alzner [9]. In fact, the function log a — 
Y (æ) is strictly completely monotonic for a > 0. This says that the log-likelihood 
ly (B, œ) is a concave function in a > 0 and the solution to the score equation is 


unique, giving the MLE of a and g, respectively. 


5.3.8 Lab: Inverse Gaussian GLM for Claim Sizes 


We present the inverse Gaussian GLM in this section as a competing model to the 
gamma GLM studied in the previous section. 


Infinite Divisibility 


In the gamma model above we have used that the total claim amount S = vin Zj 
has a gamma distribution for given claim counts N = n > Q0 and i.i.d. gamma 
claim sizes Zj. This property is closely related to divisibility. A random variable S 
is called divisible by n € N if there exist i.i.d. random variables Z1, ..., Zn such 
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that 
d n 
s2 z, 
j=l 


and S is called infinitely divisible if S is divisible by n for all n € N. The EDF 
is based on parameters (9,w) € © x W. Jorgensen [203] gives the following 
interesting result. 


Theorem 5.27 (Theorem 3.7 in Jørgensen [203], Without Proof) Choose a 
member of the EDF with parameter set © x W. Then 


e the index set W is an additive semi-group and N C W C R4, and 
¢ the members of the chosen EDF are infinitely divisible if and only if W = R+. 


This theorem tells us how to aggregate and disaggregate within EDFs, e.g., 
the Poisson, gamma and inverse Gaussian models are infinitely divisible, and the 
binomial distribution is divisible by n with the disaggregated random variables 
belonging to the same EDF and the same canonical parameter, see Sect. 2.2.2. In 
particular, we also refer to Corollary 2.15 on the convolution property. 


Inverse Gaussian Generalized Linear Model 


Alternatively to the gamma GLM one often explores an inverse Gaussian GLM 
which has a cubic variance function V (u) = >. We bring this inverse Gaussian 
model into the same form as the gamma model of Sect.5.3.7, so that we can 
aggregate claims within insurance policies. The mean, the variance and the moment 
generating function of an inverse Gaussian random variable Z;,; with parameters 
di, Ci > 0 are given by 


[Zj] =, Var(Zij)=— and Mz,,(r) = Vc —2 
ul wl=—, ar( i= 4 an Zij) = exp jy ai | ci — cp — 2r |t 


z i 


where the moment generating function requires r < c? /2 to be finite. From the 
¿ m ni $ g a 
moment generating function we see that S; = } j=1 Zi,j is inverse Gaussian 
distributed with parameters nja; and c;. Finally, we scale Y; = S;/(nja;) which 
1/2 1/2 


a 


provides us with an inverse Gaussian distribution with parameters n,’~a,'~ and 


1/2. 1/2 


n; œ; cj. This random variable Y; has a single-parameter EDF inverse Gaussian 


distribution in its reproductive form, namely, 


yO — K (8i) 


Yi ~ f(y; 9, vi/pi) = exp 
Qi [Vi 


+ aly; vie} (5.50) 
1/2 


= i ey = (1 26; y 
fon 3 p 2y/vi E , 
ar 
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with cumulant function k (0) = —/—26 for 0 € © = (—ov, 0], weight v; = n;i, 
dispersion parameter p; = 1/a; and canonical parameter 6; = —c?/ 2. 

Similarly to the gamma case, this representation is not directly useful if the 
parameter œ; is not known. Therefore, we parametrize this model differently. 
Namely, we consider 


Y; = Si/ni ~ InvGauss (n; ai, ara) À (5.51) 
This re-scaled random variable has that same inverse Gaussian EDF (5.50), but 
we need to re-interpret the parameters. We have dispersion parameter p; = 1 Ja? 
and canonical parameter 0; = =c? /(20?). For our GLM analysis we will treat 
the parameter œ; = œ > O as a nuisance parameter that does not depend on the 
specific policy i. Thus, we have constant dispersion g = 1/a? and only the scale 
parameter c; is assumed to be policy dependent through the canonical parameter 
6; = —c?/ (207). 
We are now in the same situation as in the gamma case in Sect. 5.3.7. We choose 
the log-link for g which implies 


1 
ui = Eg LY; ] = K’ (0i) = = exp{ni} = exp(B, xi), 


J —26; 


for x; € X C RIH! describing the pre-processed features of policy i. We use the 
same feature pre-processing as in model Gamma GLM2, and we call this resulting 
model IG GLM2. Again the constant dispersion parameter y = 1/a cancels in the 
score equations, thus, we do not need to explicitly specify the nuisance parameter 
a to estimate the regression parameter B € R¢+!. However, there is an important 
difference to the gamma GLM, namely, as stated in Example 5.6, we do not have a 
concave maximization problem and Fisher’s scoring method needs a suitable initial 
value. We start the fitting algorithm in the parameters of model Gamma GLM2. 
The in-sample deviance loss in the inverse Gaussian GLM is given by 


a 1 & ni Oi — aa)? 
DLEO = + ni ( u(xi)) 


So (5.52) 
m <= o ai) Yi 


where i runs over the policies i = 1,...,m with positive claims Y; = S;/n; > 0, 

A MLE ; . . 
and “4(x;) = exp(B , xi) is the MLE estimated regression function. The unit 
deviances behave as 


2 
oi, ui) =Y; (Ye! 47") , (5.53) 


176 5 Generalized Linear Models 


Table 5.14 Run times, number of parameters, AICs, in-sample losses, tenfold cross-validation 
losses and the in-sample average claim amounts of the null gamma model, model Gamma GLM2, 
the null inverse Gaussian model, and model inverse Gaussian GLM2; the deviance losses use unit 
dispersion g = 1 


Run # In-sample Tenfold CV Average 

time | Param. AIC loss on £ loss DCV amount 
Gamma null — 1+1 14416 |2.085 2.091 24’641 
Gamma GLM2 ls 7+1 14274 1.719 1.747 25’ 130 
IG null - 1+1 14715 |5.012-10-* |5.016-10-4 | 24641 
IG GLM2 1s |7+1  |14°686 |4.793-10-4 |4.820-10-4 | 327268 


note that the log-likelihood is symmetric around its mode for scale jz; l see Fig. 5.5 
(rhs). From this we receive deviance residuals (for v/g = 1) 


, 1/2 ae - 
rP = sign (Y; — ui) 010%, mi) = Y; (uy! -Y, Ji 


Thus, these residuals behave as y” 2 for Y; — œ (and fixed ie which is 


more heavy-tailed than the cube-root behavior Y. a ? in the gamma case, see (5.48). 
Another difference to the gamma case is that the deviance loss (5.52) is not scale- 
invariant, see also (11.4), below. 

We revisit the example of Table 5.13, but we replace the gamma distribution 
by the inverse Gaussian distribution. The results in Table 5.14 show that the inverse 
Gaussian model is not fully competitive on this data set. In view of (5.43) we observe 
that the coefficient of variation (standard deviation divided by mean) is in the gamma 
model given by 1/./a, thus, in the gamma model this coefficient of variation is 
independent of the expected claim size u; and only depends on the shape parameter 
a. In the inverse Gaussian model the coefficient of variation is given by 


Vco(Z;, j) = a een = Vii 


[Z;,;] a 


thus, it monotonically increases in the expected claim size u;i. It seems that this 
structure is not fully suitable for this data set, i.e., there is no indication that the 
coefficient of variation increases in the expected claim size. We come back to a 
comparison of the gamma and the inverse Gaussian model in Sect. 11.1, below. 


5.3.9 Log-Normal Model for Claim Sizes: A Short Discussion 


Another way to improve the gamma model of Sect. 5.3.7 could be to use a log- 
normal distribution instead. In the above situation this does not work because the 
observations are not in the right format. If the claim observations Z; ; are log- 
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normally distributed, then log(Z;, j) are normally distributed. Unfortunately, in our 
Swedish motorcycle data set we do not have individual claim observations Z;, ;, 
but the provided information is aggregated over all claims per insurance policy, i.e., 
Si = i Zi j- Therefore, there is no possibility here to challenge the gamma 
framework of Sect.5.3.7 with a corresponding log-normal framework, because 
the log-normal framework is not closed under summation of i.i.d. log-normally 
distributed random variables. 

We would like to give some remarks that concern calculations on the log-scale (or 
any other strictly increasing and concave transformation of the original data). For the 
log-normal distribution, as well as in similar cases like the log-gamma distribution, 
one works with logged observations Y; = log(Z;). This is a strictly monotone 
transformation and the MLEs in the log-normal model based on observations Z; 
and in the normal model based on observations Y; = log(Z;) coincide. This can be 
seen from the following calculation. We start from the log-normal density on R+, 
and we do a transformation of variable z > Ot y = log(z) € R with dy = dz/z 


2 2 
Z; H,O dz = ——-— exp, = log(z) — H dz 
fin( ) Zz | 2 2 ( g( ) ) | 


= exp |- Q- w?) dy = f(y; u, o°)dy. 
From this we see that the MLEs will coincide. 

In many situations, one assumes that o? > 0 is a given nuisance parameter, 
and one models x +> u(x) with a GLM within the single-parameter EDF. In the 
log-normal/Gaussian case one typically chooses the canonical link on the log-scale 
which is the identity function. This then allows one to perform a classical linear 
regression for u(x) = (f, x) using the logged observations Y = (Y,..., Yn)! = 
(log(Z1),...,log(Z,))', and the corresponding MLE is given by 


agen a sue ab (5.54) 


B 
for full rank g + 1 < n design matrix X. Note that in this case we have a closed- 
form solution for the MLE of £. This is called the homoskedastic case because 
all observations Y; are assumed to have the same variance o”, otherwise, in the 
heteroskedastic case, we would still have to include the covariance matrix. 

Since we work with the canonical link on the log-scale we have the balance 
property on the log-scale, see Corollary 5.7. Thus, we receive unbiasedness 


i=l 


i=l i=l i=1 
(5.55) 


r= 

w) 
É 
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Tukey-Anscombe plot 


5 Generalized Linear Models 


variance correction in log-normal models 
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fitted log-means E[Y]=E[log Z] 


fitted log-means E[Y]=E[log Z] 


Fig. 5.12 (lhs) Tukey-Anscombe plot of the fitted Gaussian model ji(x;) on the logged claim 
sizes Y; = log(Z;), and (rhs) estimated means fiz, as a function of fi(x;) considering 
heteroskedasticity G(x; ) 


If we move back to the original scale of the observations Z; we receive from the 
log-normal assumption 


_ ML 
GME 92) LZ] = exp | (B 


E xi) + 07/2} : 


Therefore, we need to adjust with the nuisance parameter ø? for the back- 
transformation to the original observation scale. At this point, typically, the dif- 
ficulties start. Often, a good back-transformation involves a feature dependent 
variance parameter o7(x;), thus, in many practical applications the homoskedas- 
ticity assumption is not fulfilled, and a constant variance parameter choice leads to 
a poor model on the original observation scale. 

A suitable estimation of o2(x;) may turn out to be rather difficult. This is 
illustrated in Fig. 5.12. The left-hand side of this figure shows the Tukey—Anscombe 
plot of the homoskedastic case providing unscaled (o2 = 1) (Pearson’s) residuals 
on the log-scale 


r? = log(Z;) — @(xi) = Y; — @(xi). 


The light-blue color shows an insurance policy dependent standard deviation 
estimate G(x;). In our case this estimate is non-monotone in 7i(x;) (which is quite 
common on real data). Using this estimate we can estimate the means of the log- 
normal random variables by 


MZ, = 


BIZ] = exp {Ai +5012}. 
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The right-hand side of Fig.5.12 plots these estimated means jiz, against the 
estimated means 7i(x;) on the log-scale. We observe a graph that is non-monotone, 
implied by the non-monotonicity of the standard deviation estimate G(x;) as a 
function of {i(x;). This non-monotonicity is not bad per se, as we still have a 
proper statistical model, however, it might be rather counter-intuitive and difficult to 
explain. For this reason it is advisable to directly model the expected value by one 
single function, and not to decompose it into different regression functions. 

Another important point to be considered is that for model selection using AIC 
we have to work on the same scale for all models. Thus, if we use a gamma model to 
model Z;, then for an AIC selection we need to evaluate also the log-normal model 
on that scale. This can be seen from the justification in Sect. 4.2.3. 

Finally, we focus on unbiasedness. Note that on the log-scale we have unbiased- 
ness (5.55) through the balance property. Unfortunately, this does not carry over to 
the original scale. We give a small example, where we assume that there is neither 
any uncertainty about the distributional model nor about the nuisance parameter. 
That is, we assume that Z; are i.i.d. log-normally distributed with parameters u and 
o7, where only u is unknown. The MLE of u is given by 


—~MLE _ A : yw 2 
AME = -$ log(Z:) ~ Nu, 07/1). 


i=l 


In this case we have 


1 
= D Eynon [Egue olz] = = 2 Egon [expia] explo? /2) 


=explu+ (1+ n-')o?/2] 


| 
{u+o7/2} = 5 iua lZ. 


i=l 


Volatility in parameter estimation QM} leads to a positive bias in this case. Note 
that we have assumed full knowledge of the distributional model (i.i.d. log-normal) 
and the nuisance parameter a? in this calculation. If, for instance, we do not know 
the true nuisance parameter and we work with (deterministic) č 2 < o2 andn > 1, 
we can get a negative bias 


te PE 7 
— 3 Ecu,02) [Eque z2 alZi]] = z 5 Equ,o2) [expa] exp{a~/2} 
i=l 


= exp fu +0?/ 2n) + 32/2} 


1 n 
< exp [u + 0?/2] = -5 vcu, o2) [Zi]. 
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This shows that working on the log-scale is rather difficult because the back- 
transformation is far from being trivial, and for unknown nuisance parameter not 
even the sign of the bias is clear. Similar considerations apply to the frequently used 
Box-—Cox transformation [48] for x Æ 1 


For this reason, if unbiasedness is a central requirement (like in insurance pricing) 
non-linear transformations should only be used with great care (and only if 
necessary). 


5.4 Quasi-Likelihoods 


Above we have been mentioning the notion of over-dispersed Poisson models. 
This naturally leads to so-called quasi-Poisson models and quasi-likelihoods. The 
framework of quasi-likelihoods has been introduced by Wedderburn [376]. In this 
section we give the main idea behind quasi-likelihoods, and for a more detailed 
treatment and mathematical results we refer to Chapter 8 of McCullagh—Nelder 
[265]. 

In Sect. 5.1.4 we have discussed the estimation of GLMs. This has been based 
on the explicit knowledge of the full log-likelihood function £y (£) for given data 
Y. This has allowed us to calculate the score equations s(B, Y) = Vgly(B) = 0 
whose solutions (Z-estimators) contain the MLE for B. The solutions of the score 
equations themselves, using Fisher’s scoring method, no longer need the explicit 
functional form of the log-likelihood, but they are only based on the first and 
second moments, see (5.9) and Remarks 5.4. Thus, all models where these first 
two moments coincide will provide the same MLE for the regression parameter 
B; this is also the explanation behind the IRLS algorithm. Moreover, the first two 
moments are sufficient for prediction and uncertainty quantification based on mean 
squared errors, and they are also sufficient to quantify asymptotic normality. This is 
exactly what motivates the quasi-likelihood considerations, and these considerations 
are also related to the quasi-generalized pseudo maximum likelihood estimator 
(QPMLE) that we are going to discuss in Theorem 11.8, below. 

Assume that Y is a random vector having first moment u € R”, positive 
definite variance function V (w) € R”*%” and dispersion parameter g. The quasi- 
(og-)likelihood function £y (w) assumes that its gradient is given by 


1 
Vuly (w) = rice Y- n). 


In case of a diagonal variance function V (m) this relates to the score (5.9). The 
remaining step is to model the mean parameter y = u(B) € R” as a function of a 
lower dimensional regression parameter B € R1+!, we also refer to Fig. 5.2. For 
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this last step we assume that the Jacobian B € R"*9+) of du/dB has full rank 
q + 1. The score equations for B and given observations Y then read as 


1 T —1 = MS 
Pal Vue) (Y — w(B)) = 0. 


This is of exactly the same structure as the score equations in Proposition 5.1, and 
the roots are found by using the IRLS algorithm for t > 0, see (5.12), 


a)... ZED “Gp ) | ~h- { peo Pe 
BO > BOP = (BTV@O'B) BTVGO)! (BBO +Y - 2°), 


where p = uO). 


We conclude with the following points about quasi-likelihoods: 


e For regression parameter estimation within the quasi-likelihood framework it 
is sufficient to know the structure of the first two moments “(B) € R” and 
V (m) € R”*” as well as the score equations. Thus, we do not need to explicitly 
specify a distributional family for the observations Y. This structure of the first 
two moments is then sufficient for their estimation using the IRLS algorithm, i.e., 
we receive the predictors within this framework. 

e Since we do not specify the full distribution of Y we can neither simulate from 
this model nor can we calculate quantities where the full log-likelihood of the 
model needs to be known. For example, we cannot calculate AIC in a quasi- 
likelihood model. 

¢ The quasi-likelihood model is characterized by the functional forms of 4(6) and 
V (m). The former plays the role of the link function and the linear predictor in the 
GLM, and the latter plays the role of the variance function within the EDF which 
is characterized through the cumulant function «. For instance, if we assume to 
have a diagonal matrix 


V (u) = diag(V (m1), ---, V(un)), 


then, the choice of the variance function y +» V (jz) describes the explicit 
selection of the quasi-likelihood model. If we choose the power variance function 
V(u) = u’, p Z (0, 1), we have a quasi-Tweedie’s model. 

e For prediction uncertainty evaluation we also need an estimate of the dispersion 
parameter g > 0. Since we do not know the full likelihood in this approach, 
Pearson’s estimate 9° is the only option we have to estimate g. 

e For asymptotic normality results and hypothesis testing within the quasi- 
likelihood framework we refer to Section 8.4 of McCullagh—Nelder [265]. 
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5.5 Double Generalized Linear Model 


In the derivations above we have treated the dispersion parameter g in the GLM as 
a nuisance parameter. In the case of a homogeneous dispersion parameter it can be 
canceled in the score equations for MLE, see (5.9). Therefore, it does not influence 
MLE, and in a subsequent step this nuisance parameter can still be estimated 
using, e.g., Pearson’s or deviance residuals, see Sect. 5.3.1 and Remark 5.26. In 
some examples we may have systematic effects in the dispersion parameter, too. 
In this case the above approach will not work because a heterogeneous dispersion 
parameter no longer cancels in the score equations. This has been considered in 
Smyth [341] and Smyth—Verbyla [343]. The heterogeneous dispersion situation is 
of general interest for GLMs, and it is of particular interest for Tweedie’s CP GLM 
if we interpret Tweedie’s distribution [358] as a CP model with i.i.d. gamma claim 
sizes, see Proposition 2.17; we also refer to Jérgensen—de Souza [204], Smyth- 
Jgrgensen [342] and Delong et al. [94]. 


5.5.1 The Dispersion Submodel 


We extend model assumption (5.1) by assuming that also the dispersion parameter 
gj is policy i dependent. Assume that all random variables Y; are independent and 
have densities w.r.t. a ø -finite measure v on R given by 


yiGi — K (0i) 


Y ~ JOu 9%, Ui /¢i) = exp 
Qi / Vi 


+a(yi; v/e) ; 


for | < i < n, with canonical parameters 6; € Ô, exposures v; > 0 and dispersion 
parameters g; > 0. As in (5.5) we assume that every policy i is equipped with 
feature information x; € X such that for a given link function g : M — R we can 
model its mean as 


xi > glui) = g(u(xi)) = g (Eo, Yil) = ni = (Wi) = (B, xi). (5.56) 


This provides us with log-likelihood function for observation Y = (Y1,..., Yn)! 


B > tr) = J E| AED) = «huey | + ae w/e). 


i=l 


with canonical link h = (x’)~!. The difference to (5.7) is that the dispersion 
parameter ø; now depends on the insurance policy which requires additional 
modeling. We choose a second strictly monotone and smooth link function gy : 
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R+ — R, and we express the dispersion of policy 1 < i < n by 


8o(Gi) = 8o (Y(Zi)) = (V, Zi), (5.57) 


where z; is the feature of policy i, which may potentially differ from x;. The 
rationale behind this different feature is that different information might be relevant 
for modeling the dispersion parameter, or feature information might be differently 
pre-processed compared to the response Y;. We now need to estimate two regression 
parameters B and y in this approach on possibly differently pre-processed feature 
information x; and z; of policy i. In general, this is not easily doable because the 
term a(Y;; v;/g;) of the log-likelihood of Y; may have a complicated structure (or 
may not be available in closed form like in Tweedie’s CP model). 


5.5.2 Saddlepoint Approximation 


We reformulate the EDF density using the unit deviance ay, u) defined in (2.25); 
we drop the lower index i for the moment. Set 0 = A(n) € © for the canonical link 
h, then 


f(y; 6, 0/9) = exp {© tpn = «(al +a v/o] 


= exp |z [yh (y) — «K (h(y))] + a(y; wo) exp {- oy, w| 


1 
29/v 
E atO; o) exp -2 00, w}, (5.58) 


with w = v/g € W. This corresponds to (2.27), and it brings the EDF density into 
a Gaussian-looking form. A general difficulty is that the term a*(y; w) may have a 
complicated structure or may not be given in closed form. Therefore, we consider 
its saddlepoint approximation; this is based on Section 3.5 of Jørgensen [203]. 

Suppose that we are in the absolutely continuous EDF case and that « is steep. 
In that case Y € M, a.s., and the variance function y > V(y) is well-defined for 
all observations Y = y, a.s. Based on Daniels [87], Barndorff-Nielsen—Cox [24] 
proved the following statement, see Theorem 3.10 in Jørgensen [203]: assume there 
exists wọ E€ W such that for all w > wo the density (5.58) is bounded. Then, the 
following saddlepoint approximation is uniform on compact subsets of the support 
¥ of Y 


1/2 
f(y; 6, v/g) = (Evo) exp | oy, wW} a + ogh), 
(5.59) 
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as g/v — 0. What makes this saddlepoint approximation attractive is that we can 
get rid of a complicated function a*(y; œ) by a neat approximation (=e V (y)! 
for sufficiently large volumes v, and at the same time, this does not affect the unit 
deviance 0(y, u), preserving the estimation properties of u. The discrete counterpart 
is given in Theorem 3.11 of Jørgensen [203]. 

Using saddlepoint approximation (5.59) we receive an approximate log- 
likelihood function 


ir cy 1 Qn 
lyu, p) ~ s[e v9(¥, u) — log (9) | — 5 log Zv). 


This approximation has an attractive form for dispersion estimation because it gives 


an approximate EDF for observation 0 = vd0(Y, u), for given u. Namely, for 


canonical parameter ¢ = —g~! < 0 we have approximation 
dp — (—log(—@)) 1 27 
trup) ~ PE OY Dog (= vo) 7 (5.60) 


The right-hand side has the structure of a gamma EDF for observation 0 with 
canonical parameter @ < 0, cumulant function ko (ġ) = — log(—@) and dispersion 
parameter 2. Thus, we have the structure of an approximate gamma model on the 
right-hand side of (5.60) with, for given m, 


1 
tpld|u] ~ Koo) = — Pa = Ọ, (5.61) 
1 
Vary Olu) © 2!(p) = 255 =o, (5.62) 


These statements say that for given u and assuming that the saddlepoint approx- 
imation is sufficiently accurate, 0 is approximately gamma distributed with shape 
parameter 1/2 and canonical parameter ¢ (which relates to the dispersion ø in the 
mean parametrization). Thus, we can estimate ¢ and g, respectively, with a (second) 
GLM from (5.60), for given mean parameter ju. 


Remarks 5.28 


e The accuracy of the saddlepoint approximation is discussed in Section 3.2 of 
Smyth—Verbyla [343]. The saddlepoint approximation is exact in the Gaussian 
and the inverse Gaussian case. In the Gaussian case, we have log-likelihood 

dp — (— log (—¢) 1 27 
tyu, p) = ~~ = tog (=), 
2 2 v 
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with variance function V(Y) = 1. In the inverse Gaussian case, we have log- 
likelihood 


trong = P= ECP _ S108 (= 7°), 


2 y? 
2 2 v 


with variance function V (Y) = Y>. Thus, in the Gaussian case and in the inverse 
Gaussian case we have a gamma model for 0 with mean g and shape parameter 
1/2, for given jz; for a related result we also refer to Theorem 3 of Blesild—Jensen 
[38]. For Tweedie’s models with p > 1, one can show that the relative error of the 
saddlepoint approximation is a non-increasing function of the squared coefficient 
of variation T = £V(y) p? = z yP7?, leading to small approximation errors if 
g/v is sufficiently small; typically one requires t < 1/3, see Section 3.2 of 
Smyth—Verbyla [343]. 

e The saddlepoint approximation itself does not provide a density because in gen- 
eral the term O(g/v) in (5.59) is non-zero. Nelder—Pregibon [282] renormalized 
the saddlepoint approximation to a proper density and studied its properties. 

° In the gamma EDF case, the saddlepoint approximation would not be necessary 
because this case can still be solved in closed form. In fact, in the gamma EDF 
case we have log-likelihood, set ¢ = —v/g < 0, 


pour, u) — x) 


ty(u,o) = 5 


— log Y, (5.63) 
with x¥(¢) = 2(logr(—¢) + ¢log(—¢) — ¢). For given yw, this is an EDF 
for 0(Y, u) with cumulant function x on the effective domain (—oo, 0). This 
provides us with expected value and variance 


1 
Dp lO(Y, ulu] = x ($) = 2(—W(—¢) + log(—9)) ~ — p 


VOT, ww = 2x") = 4 (Wd) - =), 
with digamma function W and the approximation exactly refers to the sad- 
dlepoint approximation; for the variance statement we also refer to Fisher’s 
information (2.30). For receiving more accurate mean approximations one can 
consider higher order terms, e.g., the second order approximation is x’(@) ~ 
—1/¢ + 1/(6¢7). In fact, from the saddlepoint approximation (5.60) and from 
the exact formula (5.63) we receive in the gamma case Stirling’s formula 


C) © Vamp’ Pe’, 


In the subsequent examples we will just use the saddlepoint approximation also 
in the gamma EDF case. 
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5.5.3 Residual Maximum Likelihood Estimation 


The saddlepoint approximation (5.60) proposes to alternate MLE of £ for the mean 
model (5.56) and of y for the dispersion model (5.57). Fisher’s information matrix 
of the saddlepoint approximation (5.60) w.r.t. the canonical parameters 6 and ¢ is 
given by 


: pur" (@0)  —v (Y ~x’(6)) oV 0 
T0, $) = -E : -|70 l 
ee a (Y—«'@) -3% ) ( 0 e A 


with variance function Vo (p) = g?, and emphasizing that we work in the canonical 
parametrization (0, @). This is a positive definite diagonal matrix which suggests 
that the algorithm alternating the 6 and y estimations will have a fast convergence. 
For fixed estimate Y we calculate estimated dispersion parameters @; = Bo LỌ, zi) 
of policies 1 < i < n, see (5.57). These then allow us to calculate diagonal working 
weight matrix 


. dg(ui)\ v 1 
W(B) = dia ( ) ne eR” 
e( ðhi Pi V (ui) E 


which is used in Fisher’s scoring method/IRLS algorithm (5.12) to receive MLE B, 
given the estimates (G;);. These MLEs allow us to estimate the mean parameters 
fi = g7! (B, xi), and to calculate the deviances 


d; = vid (Vi, Mi) = 2v; (Yih Y) — x (a Y) — Yih i) +e (hHi))) = O. 
Using (5.60) we know that these deviances can be approximated by gamma 
distributions (1/2, 1/(2g;)). This is a single-parameter EDF with dispersion 


parameter 2 (as nuisance parameter) and mean parameter g;. This motivates the 
definition of the working weight matrix (based on the gamma EDF model) 


Pee) i 1 
W, =d OPi es € Roe, 
o(y) = diag (( dg; 2V, (i) oo 


and the working residuals 


ago (Gi t 
Ro, y) = (AMo, -= 0) e R”. 


i 1l<i<n 
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Fisher’s scoring method (5.12) iterates for s > 0 the following recursion to receive 


a~ 


y 


PO r POD = (3 oP) 37W PO) (37 +R 0,79), 
(5.64) 
where 3 = (Z1,..., Zn)! is the design matrix used to estimate the dispersion 
parameters. 


5.5.4 Lab: Double GLM Algorithm for Gamma Claim Sizes 


We revisit the Swedish motorcycle claim size data studied in Sect. 5.3.7. We expand 
the gamma claim size GLM to a double GLM also modeling the systematic effects 
in the dispersion parameter. In a first step we need to change the parametrization of 
the gamma model of Sect. 5.3.7. In the former section we have modeled the average 
claim size S;/nj ~ T (niai, nici), but for applying the saddlepoint approximation 
we should use the reproductive form (5.44) of the gamma model. We therefore set 


Y; = Si/(niai) ~ T (nidi, njajc;). (5.65) 


The reason for the different parametrization in Sect. 5.3.7 has been that (5.65) is not 
directly useful if æ; is unknown because in that case the observations Y; cannot be 
calculated. In this section we estimate g; = 1/œ; which allows us to model (5.65); 
a different treatment within Tweedie’s family is presented in Sect. 11.1.3. The only 
difficulty is to initialize the double GLM algorithm. We proceed as follows. 


(0) In an initial step we assume constant dispersion g; = 1/a; = 1/æ = 1. This 
gives us exactly the mean estimates of Sect. 5.3.7 for S;/n;i ~ (nia, nici); 
note that for constant shape parameter œ the mean of S;/n; can be estimated 
without explicit knowledge of œ (because it cancels in the score equations). 
Using these mean estimates we calculate the MLE @) of the (constant) shape 
parameter a, see Remark 5.26. This then allows us to determine the (scaled) 
observations Y a = §;/(nj@) and we initialize pO = 1/@®. 

(1) Iterate for t > 1: 


— estimate the mean u; of Y; using the mean GLM (5.56) based on the 


observations Y e and the dispersion estimates pE”. This provides us with 


—~(t 
Ai”: 


— based on the deviances a = va YP, n”), calculate the updated dis- 
persion estimates pP using the dispersion GLM (5.57) and the residual 
MLE iteration (5.64) with the saddlepoint approximation. Set for the updated 
observations yet = SiG? /ni. 
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Table 5.15 Number of parameters, AICs, Pearson’s dispersion estimate, in-sample losses, tenfold 
cross-validation losses and the in-sample average claim amounts of the null model (gamma 
intercept model) and the (double) gamma GLM 


# Dispersion | In-sample | Tenfold CV | Average 

Param. | AIC est. 9? loss on £ | loss DEVY amount 
Gamma null 1+1 14°416 | 2.057 2.085 2.091 24’641 
Gamma GLM2 7+1 14274 | 1.544 1.719 1.747 25° 130 
Double gamma GLM |7+6 | 14°258 |- a721) l|- 26°413 


In an initial double GLM analysis we use the feature information z; = x; for the 
dispersion g; modeling (5.57). We choose for both GLMs the log-link which leads to 
concave maximization problems, see Example 5.5. Running the above double GLM 
algorithm converges in 4 iterations, and analyzing the resulting model we observe 
that we should drop the variable RiskClass from the feature z;. We then run the 
same double GLM algorithm with the feature information x; and the new z; again, 
and the results are presented in Table 5.15. 

The considered double GLM has parameter dimensions B € R’ and y € R°. To 
have comparability with AIC of Sect.5.3.7, we evaluate AIC of the double GLM 
in the observations S;/n; (and not in Y;; i.e., similar to the gamma GLM). We 
observe that it has an improved AIC value compared to model Gamma GLM2. 
Thus, indeed, dispersion modeling seems necessary in this example (under the 
GLM2 regression structure). We do not calculate in-sample and cross-validation 
losses in the double GLM because in the other two models of Table 5.15 we have 
set p = 1 in these statistics. However, the in-sample loss of model Gamma GLM2 
with g = | corresponds to the (homogeneous) deviance dispersion estimate (up to 
scaling n/(n — (q + 1))), and this in-sample loss of 1.719 can directly be compared 
to the average estimated dispersion m~! $7; @; = 1.721 (in round brackets in 
Table 5.15). On the downside, the double GLM has a bigger bias which needs an 
adjustment. 

In Fig.5.13 (lhs) we give the normal plots of model Gamma GLM2 and the 
double gamma GLM model. This plot is received by transforming the observations 
to normal quantiles using the corresponding estimated gamma models. We see 
quite some similarity between the two estimated gamma models. Both models 
seem to have similar deficiencies, i.e., dispersion modeling improves explanation 
of observations, however, either the regression function or the gamma distributional 
assumption does not fully fit the data, especially for small claims. Finally, in 
Fig. 5.13 (rhs) we plot the estimated dispersion parameters @; against the logged 
estimated means log(/z;) (linear predictors). We observe that the estimated disper- 
sion has a (weak) U-shape as a function of the expected claim sizes which indicates 
that the tails cannot fully be captured by our model. This closes this example. 


Remark 5.29 For the dispersion estimation @ we use as observations the deviances 
0; = vid (Yi, ti), 1 < i < n. Ona finite sample, these deviances are typically 
biased due to the use of the estimated means {z;. Smyth—Verbyla [343] propose the 
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normal plot of the fitted gamma models estimated dispersion vs. logged estimated means 


aa o = | ° Double GLM o 


gy ° — Gamma GLM2 in-sample loss 
€ 


observed values 
0 
1 
dispersion parameter 


ol: o Gamma GLM2 
i e Double GLM irs} 
T T T T T T T o 


-3 -2 -1 ie) 1 2 3 10 11 
theoretical values logged (linear) predictor 


œo- 
o 


Fig. 5.13 (lhs) Normal plot of the fitted models Gamma GLM2 and double GLM, (rhs) estimated 
dispersion parameters @; against the logged estimated means log({i;) (the orange line gives the 
in-sample loss in model Gamma GLM2) 


following bias correction. Consider the estimated hat matrix defined by 
PA A =l A 
H = WG, PPX (TWEDE) x WED, 


with the diagonal work weight matrix We, y) depending on the estimated 
regression parameters B and ¥ through u and gy. Denote the diagonal entries of 
the hat matrix by (hi i)i<i<n. A bias corrected version of the deviances is received 
by considering observations (1 — hii); =(1- hii) lvi (Yi, ti) 1 <i <n. 
We will come back to the hat matrix H in Sect. 5.6.1, below. 


5.5.5  Tweedie’s Compound Poisson GLM 


A popular situation for applying the double GLM framework is Tweedie’s CP 
model introduced in Sect. 2.2.3, in particular, we refer to Proposition 2.17 for the 
corresponding parametrization. Having claim frequency and claim sizes involved, 
such a model can hardly be calibrated with one single regression function and a 
constant dispersion parameter. An obvious choice is a double GLM, this is the 
proposal presented in Smyth—Jgrgensen [342]. In most of the cases one chooses for 
both link functions g and gy the log-links because positivity needs to be guaranteed. 
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This implies for the two working weight matrices of the double GLM 


5 1 _ a 
W(B) = diag (už aa) = diag (a P=) l 
Qi V (mi) l<i<n Qi l<i<n 


; 1 
Wey) = diag (+ : 


= diag(1/2,..., 1/2). 
Was). iag(1/ /2) 


The deviances in Tweedie’s CP model are given by, see (4.18), 


l-P _ œl- 2-p _ ~2-p 
D YP QIP yP oI 
di = vid (Yi, Mi) = 2vi p a an ee = 0, 
l-p 2-—p 


and these deviances could still be de-biased, see Remark 5.29. The working 
responses for the two GLMs are 


R=(¥i/Mi-Wfeien and — Rọ = @i/Gi — Di zien: 


The drawback of this approach is that it only considers the (scaled) total claim 
amounts Y; = S;g;/vj as observations, see Proposition 2.17. These total claim 
amounts consist of the number of claims N; and i.i.d. individual claim sizes 
Zij ~ V(a@, ci), supposed N; > 1. Having observations of both claim amounts 
Si and claim counts N; allows one to build a Poisson GLM for claim counts and 
a gamma GLM for claim sizes which can be estimated separately. This has also 
been the reason of Smyth—Jérgensen [342] to enhance Tweedie’s model estimation 
for known claim counts in their Section 4. Moreover, in Theorem 4 of Delong et 
al. [94] it is proved that the two GLM approaches can be identified under log-link 
choices. 


5.6 Diagnostic Tools 


In our examples we have studied several figures like AIC, cross-validation losses, 
etc., for model and parameter selection. Moreover, we have plotted the results, for 
instance, using the Tukey—Anscombe plot or the QQ plot. Of course, there are 
numerous other plots and tools that can help us to analyze the results and to improve 
the resulting models. We present some of these in this section. 


5.6.1 The Hat Matrix 


The MLE p satisfies at convergence of the IRLS algorithm, see (5.12), 


A PA -1 Ai ~ AJ 
p Z (waz) x wee) (xB + R(Y, Bp) 
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with working residuals for B € R+! 


T 


Y; — ui (ß)) e R”. 
=p; (f) l<i<n 


Following Section 4.2.2 of Fahrmeir—Tutz [123], this allows us to define the so- 
called hat matrix, see also Remark 5.29, 


PMLE \ as e R™ 
t 


(5.66) 


H = HG) = wE a (wea) xT WB 


recall that the working weight matrix W(f) is diagonal. The hat matrix H is 
symmetric and idempotent, i.e. H 2 = H, with trace(H) = rank(H) = q+ 1. 
Therefore, H acts as a projection, mapping the observations Y to the fitted values 


F E WB? (xp + RE, B) > HY = wA eap 


MLE 25, 


= WB 


the latter being the fitted linear predictors. The diagonal elements h; ; of this hat 
matrix H satisfy 0 < h;i; < 1, and values close to 1 correspond to extreme data 
points i, in particular, for h;,; = 1 only observation F; influences 7;, whereas for 
hi i = 0 observation F; has no influence on 7. 

Figure 5.14 gives the resulting hat matrices of the double gamma GLM of 
Sect. 5.5.4. On the left-hand side we show the diagonal entries h;,; of the claim 


diagonal of hat matrix for Y diagonal of hat matrix for dispersion 


diagonal elements h_ii 
0.00 0.01 0.02 0.03 0.04 0.05 0.06 
, 
diagonal elements h_ii 
0.00 0.01 0.02 0.03 0.04 0.05 0.06 
L 


od 
o 


8 9 10 11 10 11 
logged (linear) predictor logged (linear) predictor 


Fig. 5.14 Diagonal entries h;,; of the two hat matrices of the example in Sect. 5.5.4: (Ihs) for 
means fi; and responses Y;, and (rhs) for dispersions @; and responses 0; 
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amount responses Y; (for the estimation of ui), and on the right-hand side the 
corresponding plots for the deviance responses 0; (for the estimation of g;). These 
diagonal elements h; ; are ordered on the x-axis w.r.t. the linear predictors 7;. From 
this figure we conclude that the diagonal entries of the hat matrices are bigger for 
very small responses in our example, and the dispersion plot has a couple of more 
special observations that may require further analysis. 


5.6.2 Case Deletion and Generalized Cross- Validation 


As a continuation of the previous subsection we can analyze the influence of 
an individual observation Y; on the estimation of regression parameter 8. This 
influence is naturally measured by fitting the regression parameter based on the 
full data D and based only on the observations £L,_;) = D \ {Y;}, we also refer 
to leave-one-out cross-validation in Sect. 4.2.2. The influence of observation Y; is 
then obtained by comparing a and fy. Since fitting n different models by 
individually leaving out each observation Y; is too costly, one only explores a one- 
step Fisher’s scoring update starting from p that provides an approximation to 


MLE : 
Bi , that is, 


>(1) pee) 


BY, = (Ry WOR) Fy Wen B® (xB + ROB 


MLE )) 
(=i) 


AMLE 
= (FLW Een) (iy We Br)? Fen, 


where all lower indices (_;) indicate that we drop the corresponding row or/and 
column from the matrices and vectors, and a Y a been defined in the previous 


subsection. This allows us to compare pT and pe = A to analyze the influence of 
observation Y;. 


To reformulate this approximation, we come back to the hat matrix H = 
H (B) = (hi,;)1<i,<n defined in (5.66). It fulfills 


- 
~ AMLE ~ + ~ 
WR DPB SHY =|) mij.. Y egy | eR. 


Thus, for predicting Y; we can consider the linear predictor (for the chosen link g) 


n 
a a —~MLE ^MLE ^MLE. _ ~ 
=g A=B x)= EB Oi = Wa YY Y hep; 
j=l 


5.6 Diagnostic Tools 193 


A computation of the linear predictor of Y; using the leave-one-out approximation 


~i,1 (1) Le AMLE, _ hii ~ 
ih bD (Bii xi) = mi Wai VPT; 
l— hii l— hii 


This allows one to efficiently calculate a leave-one-out prediction using the hat 
matrix H. This also motivates to study the generalized cross-validation (GCV) loss 
which is an approximation to leave-one-out cross-validation, see Sect. 4.2.2, 


A vi A 
eov = Ly a(n, gta") (5.67) 


= ay gLite HD) e a w an (gM) + (a (eH YY) J. 


i=1 


Example 5.30 (Generalized Cross-Validation Loss in the Gaussian Case) We study 
the generalized cross-validation loss DOCV in the homoskedastic Gaussian case 
vi/ = 1/o? with cumulant function «x (0) = 67/2 and canonical link g(u) = 
h(u) = u. The generalized cross-validation loss in the Gaussian case is given by 


n 


ies 1 1 eee?) 
GCV —~(—i,1) 
A =P ali ys 


oO 
i=l 


with (linear) leave-one-out predictor 


a 7 h 1 h 

A(-i,1 4 (1) i,j ~ i,i 

AY = (By, xi) = Y i Yj = —— Îi - —~_¥}. 
j=1,jži 


This gives us generalized cross-validation loss in the Gaussian case 
n 


~\2 
gov 1 yt Yi — Ni 
nN * o? l— hii , 


i=l 


with 6 independent hat matrix 


H= x(x"x) xT. 
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The generalized cross-validation loss is used, for instance, for generalized addi- 
tive model (GAM) fitting where an efficient and fast cross-validation method is 
required to select regularization parameters. Generalized cross-validation has been 
introduced by Craven—Wahba [84] but these authors replaced h;,; by pee hj,j/n. 
It holds that pe 1 hj „j = trace(H) = q + 1, thus, using this approximation we 
receive 


n ~ 2 
a 1 1 Yi — hi a 
GCV ~ 
D meda 1-7 h = 1 — ey 
a Gd j=1ħjj/n E or )) 


i=l 
ao P 
~ n-lq+l) o? 


with @ being Pearson’s dispersion estimate in the Gaussian model, see (5.30). m 


We give a numerical example based on the gamma GLM for the claim sizes 
studied in Sect. 5.3.7. 


Example 5.31 (Leave-One-Out Cross-Validation) The aim of this example is to 
compare the generalized cross-validation loss DOCV to the leave-one-out cross- 
validation loss D!°°, see (4.34), the former being an approximation to the latter. 
We do this for the gamma claim size model studied in Sect. 5.3.7. In this example 
it is feasible to exactly calculate the leave-one-out cross-validation loss because we 
have only 656 claims. 

The results are presented in Table 5.16. Firstly, the different cross-validation 
losses confirm that the model slightly (in-sample) over-fits to the data, which is 
not a surprise when estimating 7 regression parameters based on 656 observations. 
Secondly, the cross-validation losses provide similar numbers with leave-one-out 
being slightly bigger than tenfold cross-validation, here. Thirdly, the generalized 
cross-validation loss SCV manages to approximate the leave-one-out cross- 
validation loss D'°° very well in this example. 

Table 5.17 gives the corresponding results for model Poisson GLM1 of 
Sect. 5.2.4. Firstly, in this example with 610’206 observations it is not feasible 
to calculate the leave-one-out cross-validation loss (for computational reasons). 
Therefore, we rely on the generalized cross-validation loss as an approximation. 
From the results of Table 5.17 it seems that this approximation (rather) under- 
estimates the loss (compared to tenfold cross-validation). Indeed, this is an 
observation that we have made also in other examples. a 


Table 5.16 Comparison of | Gamma GLM2 


different cross-validation AMEE | 
losses for model Gamma In-sample loss D(L, f ) |1719 


GLM2 Tenfold CV loss DCV 1.747 
Leave-one-out CV loss Dio | 1.756 
Generalized CV loss DCCV 1.758 
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Pie 5.17 n of Poisson GLM1 
ifferent cross-validation MLE 

losses for model Poisson In-sample loss OE, Me’) | 24.101 

GLM1 Tenfold CV loss DEV 24.121 


Leave-one-out CV loss oe N/A 
Generalized CV loss D°CY | 24.105 


5.7 Generalized Linear Models with Categorical Responses 


The reader will have noticed that the discussion of GLMs in this chapter has 
been focusing on the single-parameter linear EDF case (5.1). In many actuarial 
applications we also want to study examples of the vector-valued parameter 
EF (2.2). We briefly discuss the categorical case since this case is frequently used. 


5.7.1 Logistic Categorical Generalized Linear Model 


We recall the EF representation of the categorical distribution studied in Sect. 2.1.4. 
We choose as v the counting measure on the finite set Y = {1,...,k +1}. A random 
variable Y taking values in y is called categorical, and the levels y € Y can either 
be ordinal or nominal. This motivates dummy coding of the categorical random 
variable Y providing 


T(Y) = Av=n.--., Ian)! € (0, 1}, (5.68) 


thus, k + 1 has been chosen as reference level. For the canonical parameter 
6 = (O,..., On) € © = RÝ we have cumulant function and mean functional, 
respectively, 


e? 


k 
KO) =log{ 1+) e% }, p= WITO) = Vox) = ~——;- 
j=l 


+ ei ei 


With these choices we receive the EF representation of the categorical distribution 
(set 0x41 = 0) 


k k+l oot Lp-n 
dF (y; 0) = exp }0'T(y)— log] 1+ Doe” | tdv(y) =] (=) dv(y). 
j=l i Vj et 
The covariance matrix of T (Y) is given by 


E(0) = Varo (T (Y)) = Vg« (0) = diag (p) — pp’ € R. 
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Assume that we have feature information x € ¥ C {1} x R4 for response variable 
Y. This allows us to lift this categorical model to a GLM. The logistic GLM assumes 


for p = (pi,..-, Pe)! e (0, 1)" a regression function, 1 < l < k, 
exp(B), x) 
x> p= pie) = PY = 1] = _, (5.69) 
1+ Doj=1 exp(B j, x) 
for regression parameter B = (Bi, wads Bi)" € Rk@+)_ Equivalently, we can 


rewrite these regression probabilities relative to the reference level, that is, we 
consider linear predictors for 1 < l < k 


PglY = 1] 


nı (x) = log (eS Zk+ i] 


) = (B1, x). (5.70) 


Note that this naturally gives us the canonical link h which we have already derived 
in Sect. 2.1.4. Define the matrix for feature x € X C {1} x RI 


x’ 0 0-0 
Ox’ 0.0 

y=) 0 0 x! ... 0 e IRExK@+D. (5.71) 
00 0.---x! 


This gives linear predictor and canonical parameter, respectively, under the canoni- 
cal link h 


6 = h(p(x)) = n(x) = XB = (Bix). (Be x))) Ee @=RE 672 


5.7.2 Maximum Likelihood Estimation in Categorical Models 


Assume we have n independent observations Y; following the logistic categorical 
GLM (5.69) with features x; € R¢+! and X; € R‘<*@+), respectively, for 1 < 
i < n. The joint log-likelihood function is given by, we use (5.72), 


n 


B +> (p) = > (Xp) TO) — «(Xi B). 


i=1 


This provides us with score equations 


s(B, Y) = Vgly(B) = >. Xj [TOD — Vox (XiB)] = >) X; (TM) - pœ] = 0, 


i=1 i=1 
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with logistic regression function (5.69) for p(x). For the score equations with 
canonical link we also refer to the second case in Proposition 5.1. Next, we calculate 
Fisher’s information matrix, we also refer to (3.16), 


T,(B) = -Es [Vtr (B| = > X7 Ei(B)X:, 


i=l 
with covariance matrix of T (Y;) 
D; (B) = Vox (XB) = diag (p(x;)) — p(xi) p)". 


We rewrite the score in a similar way as in Sect. 5.1.4. This requires for general link 
g(p) = n and inverse link p = g~!(y), respectively, the following block diagonal 
matrix 


23 
W(B) = diag (Gace E(B! (veon) ya) ) 


l<i<n 
+ -1 
= diag (Yoel pae-texia) X; ($) (YeP) ESON ’ (5.73) 
and the working residuals 
T 
RY, B) = (Pla) (TY) - pæ) (5.74) 
l<i<n 


1 


Because we work with the canonical link g = h and g-* = Vg, we can use the 


simplified block diagonal matrix 


W(B) = diag (21 (£), ..., Un(B)) € Rinxkn | 


and the working residuals 


RY, B) = (Ei) TO) - pE) RM. 


l<i<n 


Finally, we define the design matrix 


x= c RknxkGt)_ 
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Putting everything together we receive the score equations 
s(B, Y) = Vply(B) = X' W(B)R(Y, p) = 0. (5.75) 


This is now exactly in the same form as in Proposition 5.1. Fisher’s scoring 
method/IRLS algorithm then allows us to recursively calculate the MLE of B € 
Rea@t+h by 


ie ex ae = (x TWEVE) z x we”) (xB + R(Y, BÀ). 
We have asymptotic normality of the MLE (under suitable regularity conditions) 


-MLE (4) 


Ba ~ N(B,Tn(B) 


for large sample sizes n. This allows us to apply the Wald test (5.32) for back- 
ward parameter elimination. Moreover, in-sample and out-of-sample losses can 
be analyzed with unit deviances coming from the categorical cross-entropy loss 
function (4.19). 


Remarks 5.32 The above derivations have been done for the categorical distribution 
under the canonical link choice. However, these considerations hold true for more 
general links g within the vector-valued parameter EF. That is, the block diagonal 
matrix W (£) in (5.73) and the working residuals R(Y, B) in (5.74) provide score 
equations (5.75) for general vector-valued parameter EF examples, and where we 
replace the categorical probability p by the mean u = Eg[T(Y)]. 


5.8 Further Topics of Regression Modeling 


There are several special topics and tools in regression modeling that we have not 
discussed, yet. Some of them will be considered in selected chapters below, and 
some points are mentioned here, without going into detail. 


5.8.1 Longitudinal Data and Random Effects 


The GLMs studied above have been considering cross-sectional data, meaning that 
we have fixed one time period ¢ and studied this time period in an isolated fashion. 
Time-dependent extensions are called longitudinal or panel data. Consider a time 
series of data (Y; +, xi s) for policies 1 < i < n and time points t > 1. For the 
prediction of response variable Y;; we may then regress on the individual past 
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history of policy i, given by the data 
Dit = fY; 1, maso MEH Ty CE e < Xir} : 
In particular, we may explore the distribution of Y; +, conditionally given D;,;, 
Yitlo,, ~ FCDi st; 0), 


for canonical parameter 0 € © and F (-|D; s; 0) being a member of the EDF. For a 
GLM we choose a link function g and make the assumption 


g (Ep lY; (Dit) = (B, Zit), (5.76) 


where Zis € RI! isa (q + 1)-dimensional and o (Dj,;)-measurable feature vector, 
and regression parameter B € R+! describes the common systematic effects across 
all policies 1 < i < n. This gives a generalized auto-regressive model, and if we 
have the Markov property 


F(-|Di,15 9) = F(-|¥it-1, Xit; 0) for allt > 2 and0 € O, 


we obtain a generalized auto-regressive model of order 1. These longitudinal models 
allow one to model experience rating, for instance, in car insurance where the 
past claims history directly influences the future insurance prices, we refer to 
Remark 5.15 on bonus-malus systems (BMS). 

The next level of complexity is obtained by extending regression structure (5.76) 
by policy 7 specific random effects B; such that we may postulate 


g (Egl¥ir|Di.r, Bil) = (P, Zit) + (Bi, wit), (5.77) 


with o (Dj,;)-measurable feature vector w; +. Regression parameter £ then describes 
the fixed systematic effects that are common over the entire portfolio 1 <i < n 
and B; describes the policy dependent random effects (assumed to be normalized 
([B;] = 0). Typically one assumes that B,,..., Bn are centered and i.i.d. Such 
effects are called static random effects because they are not time-dependent, and 
they may also be interpreted in a Bayesian sense. 

Finally, extending these static random effects to dynamic random effects B; t, 
t > 1, leads to so-called state-space models, the linear state-space model being the 
most popular example and being fitted using the Kalman filter [207]. 


5.8.2 Regression Models Beyond the GLM Framework 


There are several ways in which the GLM framework can be modified. 
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Siblings of Generalized Linear Regression Functions 


The most common modification of GLMs concerns the regression structure, namely, 
that the scalar product in the linear predictor 


xt g(u) =n = (B,x), 


is replaced by another regression function. A popular alternative is the framework 
of generalized additive models (GAMs). GAMs go back to Hastie—Tibshirani 
[181, 182] and the standard reference is Wood [384]. GAMs consider the regression 
functions 


x > g(u) =n = Bot >. Bjsj(xj), (5.78) 
J 


where s; : R — R are natural cubic splines. Natural cubic splines s; are obtained 
by concatenating cubic functions in so-called nodes. A GAM can have as many 
nodes in each cubic spline sj as there are different levels x;, j in the data 1 <i <n. 
In general, this leads to very flexible regression models, and to control in-sample 
over-fitting regularization is applied, for regularization we also refer to Sect. 6.2. 
Regularization requires setting a tuning parameter, and an efficient determination of 
this tuning parameter uses generalized cross-validation, see Sect. 5.6. Nevertheless, 
fitting GAMs can be very computational, already for portfolios with 1 million 
policies and involving 20 feature components the calibration can be very slow. 
Moreover, regression function (5.78) does not (directly) allow for a data driven 
method of finding interactions between feature components. For these reasons, we 
do not further study GAMs in this monograph. 

A modification in the regression function that is able to consider interactions 
between feature components is the framework of classification and regression trees 
(CARTs). CARTs have been introduced by Breiman et al. [54] in 1984, and they 
are still used in its original form today. Regression trees aim to partition the feature 
space ¥ into a finite number of disjoint subsets ¥;, 1 < t < T, such that all policies 
(Y;, xi) in the same subset x; € ~X, satisfy a certain homogeneity property w.r.t. the 
regression task (and the chosen loss function). The CART regression function is 
then defined by 


T 
xr w(x) = Da Tex}; 
t=1 


where fi; is the homogeneous mean estimator on X;. These CARTs are popular 
building blocks for ensemble methods where different regression functions are 
combined, we mention random forests and boosting algorithms that mainly rely 
on CARTs. Random forests have been introduced by Breiman [52], and boosting 
has been popularized by Valiant [362], Kearns—Valiant [209, 210], Schapire [328], 
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Freund [139] and Freund—Schapire [140]. Today boosting belongs to the most 
powerful predictive regression methods, we mention the XGBoost algorithm of 
Chen-—Guestrin [71] that has won many competitions. We will not further study 
CARTs and boosting in these notes because these methods also have some 
drawbacks. For instance, resulting regression functions are not continuous nor do 
they easily allow to extrapolate data beyond the (observed) feature space, e.g., if we 
have a time component. Moreover, they are more difficult in the use of unstructured 
data such as text data. For more on CARTs and boosting in actuarial science we 
refer to Denuit et al. [100] and Ferrario-Haémmerli [125]. 


Other Distributional Models 


The theory above has been relying on the EDF, but, of course, we could also study 
any other family of distribution functions. A clear drawback of the EDF is that 
it only considers light-tailed distribution functions, i.e., distribution functions for 
which the moment generating function exists around the origin. If the data is more 
heavy-tailed, one may need to transform this data and then use the EDF on the 
transformed data (with the drawback that one loses the balance property) or one 
chooses another family of distribution functions. Transformations have already been 
discussed in Remarks 2.11 and Sect. 5.3.9. Another two families of distributions that 
have been studied in the actuarial literature are the generalized beta of the second 
kind (GB2) distribution, see Venter [369], Frees et al. [137] and Chan et al. [66], and 
inhomogeneous phase type (IHP) distributions, see Albrecher et al. [8] and Bladt 
[37]. The GB2 family is a 4-parameter family, and it nests several examples such 
as the gamma, the Weibull, the Pareto and the Lomax distributions, see Table B1 in 
Chan et al. [66]. The density of the GB2 distribution is for y > 0 given by 


ae 


(5.79) 


y 
f(y; a, b, a1, 2) = b (J 
1 


Bla, a2) (1+ (5)°) 


s ee 
-o (Se) (<a) l 


with scale parameter b > 0, shape parameters a € R and œ&1,œ2 > 0, and beta 
function 


r (œ1)I (œ2) 


B ; = 3 
(eia (a, + a2) 


Consider a modified logistic transformation of variable y KH z = (y/b)*/U + 
(y/b)“) € (0, 1). This gives us the beta density 


z%7i(1 _ z)®27! 


f (Za, 02) = Bune) 
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Thus, the GB2 distribution can be obtained by a transformation of the beta 
distribution. The latter provides that a GB2 distributed random variable Y can be 
simulated from Y ® b(Z/(1 — Z))'/4 with Z ~ Beta(ay, a2). 

A GB? distributed random variable Y has first moment 


Bia, + 1/a, a2 — 1/a) b 
B(a1, a2) 


’ 


da b,a ,021¥ | = 


for —a,a < 1 < aga. Observe that for a > 0 we have that the survival function of 
Y is regularly varying with tail index w2a > 0. Thus, we can model Pareto-like tails 
with the GB2 family; for regular variation we refer to (1.3). 

As proposed in Frees et al. [137], one can introduce a regression structure for 
b > 0 by choosing a log-link and setting 


B(a, + 1/a, a2 — 1/a) 
B(&1, a2) 


log ( Pe eee [Y]) = le ( ) T (B, x). 


MLE of 8 may pose some challenge because it depends on nuisance parameters 
a, a, a2. In a recent paper Li et al. [251], there is a proposal to extend this GB2 
regression to a composite regression model; composite models are discussed in 
Sect. 6.4.4, below. This closes this short section, and for more examples we refer 
to the literature. 


5.8.3 Quantile Regression 


Pinball Loss Function 


The GLMs introduced above aim at estimating the means w(x) = Egcx)[Y] of 
random variables Y being explained by features x. Since mean estimation can 
be rather sensitive in situations where we have large claims, the more robust 
quantile regression has attracted some attention, recently. Quantile regression has 
been introduced by Koenker—Bassett [220]. The idea is that instead of estimating 
the mean u of a random variable Y, we rather try to estimate its t-quantile for 
given t € (0, 1). The t-quantile is given by the generalized inverse F~!(r) of the 
distribution function F of Y, that is, 


F-'(t) =inf{y € R; F(y) >t}. (5.80) 


Consider the pinball loss function for y € € (convex closure of the support of Y) 
and actions a € A = R 


Qa)  Lr(y,a) = (Y — a) (t —1p-a<o}) = 0. (5.81) 
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This provides us with the expected loss for Y ~ F and actiona € A 


or [L1 (Y, a)] = Er [Y — a) (t — 1y <a) )] 
(t — DEF [Y — a)liy<a)] + TEr [Y — a)lrza)] 


=(= Df (y—a)dF(y) + a) (y —a)dF(y). 


The aim is to find an optimal action @(F) that minimizes this expected loss, 
see (4.24), 


a(F) € ACF) = argmin Ep [L;(¥,a)]. 
acA 


Note that for the time being we do not know whether the solution to this 
minimization problem is a singleton. For this reason, we state the solution (subject 
to existence) as a set-valued functional 2, see (4.25). 

We calculate the score equation of the expected loss using the Leibniz rule 


ð a [0,6] 
ŽErLY a=- f aro) —t f dF (y) 


ða 
= —(r—1)F(a)—t(1— F(@)) = F(a) -r = 0. 


Assume the distribution F is continuous. This implies F (F7 1(r)) = t, and we have 


F-'(t) € ACF) = argmin Ep [L (Y, a)]. 
acA 


In fact, using the pinball loss, we have just seen that the t-quantile is elicitable 
within the class of continuous distributions, see Definition 4.18. 

For a more general result we need a more general definition of a (set-valued) 
T-quantile 


O(P) = fy eR lim Fi) st < Foyt. (5.82) 
ZTy 


This defines a closed interval and its lower endpoint corresponds to the generalized 
inverse F~'(r) given in (5.80). In complete analogy to Theorem 4.19 on the 
elicitability of the mean functional, we have the following statement for the t- 
quantile; this result goes back to Thomson [351] and Saerens [326]. 


Theorem 5.33 (Gneiting [162, Theorem 9], Without Proof) Let F be the class of 
distribution functions on an interval € C R and choose quantile level t € (0, 1). 


¢ The t-quantile (5.82) is elicitable relative to F. 
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e Assume the loss function L : € x A —> R+ satisfies (LO)-(L2) on page 92 for 
interval € = A CR. L is consistent for the t-quantile (5.82) relative to the class 
F of compactly supported distributions on € if and only if L is of the form 


L(y, a) = (GQ) — G@) (t — Uy-a<0)) ; 


for anon-decreasing function G on €. 

e If G is strictly increasing on È and if Er[G(Y)] exists and is finite for all F € 
F, then the above loss function L is strictly consistent for the t-quantile (5.82) 
relative to the class F. 


Theorem 5.33 characterizes the strictly consistent loss functions for quantile 
estimation, the pinball loss being the special case G(y) = y. 


Quantile Regression 


The idea behind quantile regression is that we build a regression model for the t- 
quantile. Assume we have a datum (Y, x) whose conditional t-quantile, given x € 
{1} x R4, can be described by the regression function 


x > (Fri) = Be), 


for a strictly monotone and smooth link function g : € —> R, and for a regression 
parameter B, € R1*!. The aim now is to estimate this regression parameter from 
independent data (Y;, x;), 1 < i < n. The pinball loss L+, given in (5.81), provides 
us with the following optimization problem 


n 


~ 


B, = argmin XL (Ying MB, xi)) 
BeRI*! i=j 


This then allows us to estimate the corresponding t-quantile as a function of the 
feature information x. For t = 1/2 we estimate the median by 


Fy, (1/2) = g | (B12, x). 


We conclude from this short section that we can regress any quantity a(F) that is 
elicitable, i.e., for which a loss function exists that is strictly consistent for a(F) 
on F e F. For more on quantile regression we refer to the monograph of Uribe- 
Guillén [361], and an interesting paper is Dimitriades et al. [106]. We will study 
quantile regression within deep networks in Chap. 11.2, below. 
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Chapter 6 A) 
Bayesian Methods, Regularization gasii 
and Expectation-Maximization 


The previous chapter has been focusing on MLE of regression parameters within 
GLMs. Alternatively, we could address the parameter estimation problem within a 
Bayesian setting. The purpose of this chapter is to discuss the Bayesian estimation 
approach. This leads us to the notion of regularization within GLMs. Bayesian 
methods are also used in the Expectation-Maximization (EM) algorithm for MLE 
in the case of incomplete data. For literature on Bayesian theory we recommend 
Gelman et al. [157], Congdon [79], Robert [319], Bühlmann-Gisler [58] and Gilks 
et al. [158]. A nice historical (non-mathematical) review of Bayesian methods is 
presented in McGrayne [266]. Regularization is discussed in the book of Hastie et 
al. [184], and a good reference for the EM algorithm is McLachlan—Krishnan [267]. 


6.1 Bayesian Parameter Estimation 


The Bayesian estimator has been introduced in Definition 3.6. Assume that the 
observation Y has independent components Y; that can be described by a GLM 
with link function g and regression parameter B € R4t!, i.e., the random variables 
Y; have densities 


y(ho g7!)(B, xi) — (k oho 87») (B, xi) 


ind. 
Y; ~ fO; B, xi, vi/p) = exo 
g/v; 


+ aly; viol, 


with canonical link h = («’)~!. In a Bayesian approach one models the regression 
parameter ß with a prior distribution! z (£) on the parameter space R4t!, and the 
independence assumption between the components of Y needs to be understood 


' Often, in Bayesian arguing, distribution and density is used in an interchangeable (and not fully 
precise) way, and it is left to the reader to give the right meaning to 7. 
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conditionally, given the regression parameter B. In other words, all observations 
Y; share the same regression parameter B, which itself is modeled by a prior 
distribution zr. 

The joint density of Y and £ is given by 


p(y, B) = (1 fOis B, Xi, 79 z (B) = exp {€y—y(B) + log z (p)}. 
i=l 


(6.1) 
For the given observation Y, this allows us to calculate the posterior density of B 
using Bayes’ rule 


Y, 
ana TA ( 


A [] fe: 8. xi. 79 x(P), (6.2) 


i=1 


where the proportionality sign œ indicates that we have dropped the terms that do 
not depend on £. Thus, the functional form in £ of the posterior density 7(B|Y) 
is fully determined by the joint density p(Y, 8), and the remaining term is a 
normalization to obtain a proper probability distribution. In many situations, the 
knowledge of the functional form of the posterior density in B is sufficient to 
perform Bayesian parameter estimation, at least, numerically. We will give some 
references, below. 

The Bayesian estimator for B is given by the posterior mean (supposed it exists) 


Bayes 


ps n (BY = f BBA). 


If we want to calculate the expectation of a new random variable Y,, that is 
conditionally, given £, independent of Y and follows the same GLM as Y, we can 
directly calculate, using the tower property and conditional independence,” 


tt [Yn+1l| Y] = i [E [Yn+1| 8, Y]| Y] = 2x [E [Yn+1| BI Y] 


= Ex [876 xnm1)| Y] = J 8B xn) BAW), 


supposed that this first moment exists and that x„+1 is the feature of Y„+1. We see 
that it all boils down to have sufficiently explicit knowledge about the posterior 
density 7(B|Y) given in (6.2). 


Remark 6.1 (Conditional MSEP) Based on the assumption that the posterior distri- 
bution x (|Y) can be determined, we can analyze the GL. In a Bayesian setup one 


2 Note that we identify probabilities Pg[-] = P[-|B] for given £. 


6.1 Bayesian Parameter Estimation 209 


usually does not calculate the MSEP as described in Theorem 4.1, but one rather 
studies the conditional MSEP, conditioned exactly on the collected information Y. 
That is, 


tx | Yari — iy (Yee YD? y] = Varr (Yn+ı| Y) 


= Varr (E[¥n+1| B, Y]| Y) + Er [Var (Yn+1| 8, Y)| Y] 


= Vara (87B, xnr1)| Y) + Ex |e” oh o 87B, xn) ¥] 
n+1 
= Vate (87 (B, xn0)| Y) + Be [V6 MB, enD] F], 


where we need to assume existence of second moments. Similar to Theorem 4.1, 
the first term is the estimation variance (in a Bayesian setting) and the second term 
is the average process variance (using the EDF variance function u > V (u)). 


The remaining difficulty is the calculation of the posterior expectation of func- 
tions of B, based on posterior density (6.2). In very well-designed experiments the 
posterior density 7x (|Y) can be determined explicitly, for instance, in the homoge- 
neous EDF case with so-called conjugate priors, see Chapter 2 in Biihlmann-Gisler 
[58]. But in most cases, there is no closed from solution for the posterior distribution. 
Major progress in Bayesian modeling has been made with the emergence of 
computational methods like the Markov chain Monte Carlo (MCMC) method, Gibbs 
sampling, the Metropolis—Hastings (MH) algorithm [185, 274], sequential Monte 
Carlo (SMC) sampling, non-linear particle filters, and the Hamilton Monte Carlo 
(HMC) algorithm. These methods help us to empirically approximate the posterior 
density x (|Y) in different modeling setups. These methods have in common that 
the explicit knowledge of the normalizing constant in (6.2) is not necessary, but it 
suffices to know the functional form in £ of the posterior density 7(B|Y). 

For a detailed description of MCMC methods in general, which includes Gibbs 
sampling and MH algorithms, we refer to Gilks et al. [158], Green [169, 170], 
Johansen et al. [199]; SMC sampling and non-linear particle filters are explained 
in Del Moral et al. [92, 93], Johansen—Evers [199], Doucet—Johansen [111], Creal 
[85] and Wiithrich [389]; the HMC algorithm is described in Neal [281]. We do not 
present these algorithms here, but for the description of the most popular algorithms 
we refer to Section 4.4 in Wiithrich—Buser [392]. The reason for not presenting 
these algorithms here is that they still face the curse of dimensionality, which makes 
it difficult to use Bayesian methods for high-dimensional data sets in large models; 
we provide another short discussion in Sect. 11.6.3, below. 
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6.2 Regularization 
6.2.1 Maximal a Posterior Estimator 


In the previous section we have proposed to approximate the posterior density 
m(B|Y) of the regression parameter $, given Y, using MCMC methods. The 
posterior log-likelihood in the Bayesian GLM is given by, see (6.2), 


log x (|Y) œ £y (8) + log x (8) 


n 


xY Y;(h o 871) (B, xi) — (K o h o 87) (B, xi) aloe): 
= p/vi 

Compared to the classical log-likelihood function €y(8) for MLE, there is an 
additional log-density term log z (£) that comes from the prior distribution of £. 
Thus, the posterior log-likelihood is a balanced version of the log-likelihood £y (£) 
of the data Y and the prior log-density log 7 (£) of the regression parameter B. We 
interpret this as regularization because the prior x smooths extremes in the log- 
likelihood of the observation Y. This gives rise to estimate the regression parameter 
B by the so-called maximal a posterior (MAP) estimator 


wal = argmax logz(B|Y) = argmax ly(B) + log7(B). (6.3) 


BeRI+! BeRI! 


This z-regularized (MAP) parameter estimation has gained much popularity 
because it is a useful tool to prevent the model from over-fitting under suitable 
prior choices. Moreover, under specific choices, it allows for parameter selection. 
This is especially useful in high-dimensional problems; for a reference we refer to 
Hastie et al. [184]. 

Popular choices for x are prior densities coming from L? -norms for some p > 1, 
that is, 7(B) x exp{—A||B ID} for A > 0. Optimization problem (6.3) then becomes 

MAP 


B = argmax ly(B) —AllBllp, 
BeRI+! 


for a fixed regularization parameter à > O (also called tuning parameter). In 
practical applications we should exclude the intercept parameter Bo € R from 
regularization: if we work with the canonical link within the GLM framework 
we have the balance property which implies unbiasedness, see Corollary 5.7. This 
property gets lost if Bp is included in the regularization term. For this reason, we set 
B_ = (fi,..., Ba) € R1 and we let regularization only act on these components 
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oe ZS 1 
pur = BMP) = argmax -8y (B) — AlB_Il2, (6.4) 
BeRa+! 1 


we also scale with the sample size n to make the units of the tuning parameter À 
independent of the sample size n. 


Remarks 6.2 


¢ The regularization term A||B_ ID keeps the components of the regression parame- 
ter B_ close to zero, thus, it prevents from over-fitting by letting parameters only 
take moderate values. The magnitudes of the parameter values are controlled by 
the regularization parameter à > 0 which acts as a hyper-parameter. Optimal 
hyper-parameters are determined by cross-validation. 

e In (6.4) all components of B_ are treated equally. This may not be appropriate 
if the feature components of x live on different scales. This problem of different 
scales can be solved by either scaling the components of x to a unit scale, or 
by introducing a diagonal importance matrix T = diag(t,...,¢,) with t; > 0 
that describes the scales of the components of x. This allows us to regularize 
ITB ID instead of ||B_ IID. Thus, in this latter case we replace (6.4) by the 
weighted version 


q 
MAP 1 _ 
B = argmax —fy(B)—A ) t; PIB; |?. 
n 
B j=l 
e Often, the features have a natural group structure x = (x0,%1,...,XxK), for 


instance, x, € {0, 1}% may represent dummy coding of a categorical feature 
component with qg + 1 levels. In that case regularization should equally act on 
all components of B, € R% (that correspond to xg) because these components 
describe the same systematic effect. Yuan—Lin [398] proposed for this problem 
grouped penalties of the form 


K 

PR 1 

pe’ = argmax —ly(B)—2 Ñ Ball (6.5) 
B n 


k=1 


This proposal leads to sparsity, i.e., for large regularization parameters A the 
entire f, may be shrunk (exactly) to zero; this is discussed in Sect. 6.2.5, below. 
We also refer to Section 4.3 in Hastie et al. [184], and Devriendt et al. [104] 
proposed this approach in the actuarial literature. 

e There are more versions of regularization, e.g., in the fused LASSO approach we 
ensure that the first differences 6; — Bj;—1 remain small. 
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Our motivation for considering regularization has been inspired by Bayesian 
theory, but we can also come from a completely different angle, namely, we can 
consider a constraint optimization problem with a given budget constraint c > 0. 
That is, we can consider 


1 
argmax —£y (f) subject to ||B_||> < c. (6.6) 
BERI! n 


This optimization problem can be tackled by the method of Karush, Kuhn and 
Tucker (KKT) [208, 228]. Optimization problem (6.4) corresponds by Lagrangian 
duality to the constraint optimization problem (6.6). For every c for which the 
budget constraint in (6.6) is binding |B_ I = c, there is a corresponding 


regularization parameter à = A(c), and, conversely, the solution of (6.4) solves (6.6) 


MAP 
with c = ||B_ AID. 


6.2.2 Ridge vs. LASSO Regularization 


We compare the two special cases of p = 1, 2 in this section, and in the subsequent 
Sects. 6.2.3 and 6.2.4 we discuss how these two cases can be solved numerically. 


Ridge Regularization p = 2 For p = 2, the prior distribution m in (6.4) is a 
centered Gaussian distribution. This L*-regularization is called ridge regularization 
or Tikhonov regularization [353], and we have 


pidge _ Gridge 


pe’ = BY") = argmax “ty(B) - DIA (6.7) 


peRI+! j=l 


LASSO Regularization p = 1 For p = 1, the prior distribution z in (6.4) is a 
Laplace distribution. This L!-regularization is called LASSO regularization (least 
absolute shrinkage and selection operator), see Tibshirani [352], and we have 


Brasco _ Glassen) = arg max “ty(B) - D (6.8) 
BER J= 
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LASSO regularization has the advantage that it shrinks (unimportant) regression 
components to exactly zero, i.e., LASSO regularization can be used for parameter 
elimination and model reduction. This is discussed in the next paragraphs. 


Ridge vs. LASSO Regularization Ridge (p = 2) and LASSO (p = 1) 
regularization behave rather differently. This can be understood best by using the 
budget constraint (6.6) interpretation which gives us a nice geometric illustration. 
The crucial part is that the side constraint gives us either a budget constraint 
|B_II5 = Di p? < c (squared Euclidean norm) or ||B_ ||; = Di Bj <c 
(Manhattan norm). Tn Fig. 6.1 we illustrate these two cases, the left- hand side shows 
the Euclidean ball in blue color (in two dimensions) and the right-hand side shows 
the corresponding Manhattan square in blue color; this figure is similar to Figure 2.2 
in Hastie et al. [184]. 

The (unconstraint) MLE 8 is illustrated by the red dot in Fig. 6.1. If the 
red dot would lie within the blue area, the budget constraint would not be binding. 
In Fig.6.1 the red dot (MLE) does not lie within the blue budget constraint, 
and we need to compromise in the optimality of the MLE. Assume that the log- 
likelihood B > €y(B) is a concave function in B, then we receive convex level sets 


{B; £y (B) = yo} around the MLE pP. The critical constant yo for which this level 
set is tangential to the blue budget constraint exactly gives us the solution to (6.6); 
this solution corresponds to the yellow dots in Fig.6.1. The crucial difference 
between ridge and LASSO regularization is that in the latter case the yellow dot 
will eventually be in the corner of the Manhattan square if we shrink the budget 
constraint c to zero. Or in other words, some of the components of B are set 
exactly equal to zero for small c or large A, respectively; in Fig.6.1 (rhs) this 


4LASSO : i 
happens to the first component of B (under the given budget constraint c). In 
ridge regularization (L2) LASSO regularization (L1) 
© MLE @ ME 
‘© 1 @ ridge regularized ‘© | @ LASSO regularized 
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Fig. 6.1 Illustration of optimization problem (6.6) under a budget constraint (lhs) for p = 2 
(Euclidean norm) and (rhs) p = 1 (Manhattan norm) 
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Fig. 6.2 Elastic net elastic net 
regularization — ridge 
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ridge regularization this is not the case, except for special situations concerning the 
position of the red MLE. Thus, ridge regression makes components of parameter 
estimates generally smaller, whereas LASSO shrinks some of these components 
exactly to zero (this also explains the name LASSO). 


Remark 6.3 (Elastic Net) LASSO regularization faces difficulties with collinearity 
in feature components. In particular, if we have a group of highly correlated feature 
components, LASSO fails to do a grouped selection, but it selects one component 
and ignores the other ones. On the other hand, ridge regularization can deal with 
this issue. For this reason, Zou—Hastie [409] proposed the elastic net regularization, 
which uses a combined regularization term 


elastic ne 1 
potent L arg max —ty(B) -A [0 — a) BI + olga], 


BeRI*! n 


for some œ € (0, 1). The L!-term gives sparsity and the quadratic term removes 
the limitation on the number of selected variables, providing a grouped selection. 
In Fig. 6.2 we compare the elastic net regularization (orange color) to ridge and 
LASSO regularization (black and blue color). Ridge regularization provides a 
smooth strictly convex boundary (black), whereas LASSO provides a boundary that 
is non-differentiable in the corners (blue). The elastic net is still non-differentiable 
in the corners, this is needed for variable selection, and at the same time it is strictly 
convex between the corners which is needed for grouping. 
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6.2.3 Ridge Regression 


In this section we consider ridge regression (p = 2) in more detail and we provide an 
È . ridge . : z : 
example. The ridge estimator B nese in (6.7) is found by solving the score equations 


FB, Y) = Vp (tx) — naliB_I3) =" WB)R, p) —2n4B_=0, (6.9) 


note that we exclude the intercept o from regularization (we use a slight abuse of 
notation, here), and we also refer to Proposition 5.1. The negative expected Hessian 
of this optimization problem is given by 


I(B) = -Ep [V3 (Ev (8) — nAllB_I3) |] = Z(B) + 2nddiag(O, 1,...,1) € RETDXGHD, 


where Z(B) = x' Wp )X is Fisher’s information matrix of the unconstraint MLE 
problem. This provides us with Fisher’s scoring updates for t > 0, see (5.13), 


pO = anr BE + 7G)- FRO, Y). (6.10) 


Lemma 6.4 Fisher’s scoring update (6.10) can be rewritten as follows 


BO > BOY IBO WE) (zB + RUB). 


Proof A straightforward calculation shows 


a BO + IBON 1sBO Y 
= 7B) (IGOR «77 Re. BO) — 2008”) 
= TBO (LBRO +X" WB RY, B)) 
= FBO x WB) (xB + ROB). 
This proves the claim. o 


Lemma 6.4 allows us to fit a ridge regularized GLM. To determine an optimal 
regularization parameter à > 0 one uses cross-validation, in particular, generalized 
cross-validation is used to receive an efficient cross-validation method, see (5.67). 


Example 6.5 (Ridge Regression) We revisit the gamma claim size example of 
Sect. 5.3.7, and we choose model Gamma GLM1, see Listing 5.11. This example 
does not consider any categorical features, but only continuous ones. We directly 
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Fig. 6.3 Ridge regularized MLEs in model Gamma GLM1: (lhs) in-sample deviance losses as a 


function of the regularization parameter A > 0, (rhs) resulting BEO) forl<j<q=8 


apply Fisher’s scoring updates (6.10). For this analysis we center and normalize 
(to unit variance) the columns of the design matrix (except for the initial column of 
© encoding the intercept). 

Figure 6.3 (Ihs) shows the resulting in-sample deviance losses as a function of 
à > 0. Regularization parameter A allows us to continuously connect the in-sample 
deviance losses of the null model (2.085) and model Gamma GLM1 (1.717), see 
Table 5.13. Figure 6.3 (rhs) shows the regression parameter estimates be), 1< 
j < q = 8, as a function of à > 0. Overall they decrease because the budget 
constraint gets more tight for increasing à, however, the individual parameters do 
not need to be monotone, since one parameter may (better) compensate a decrease 
of another (through correlations in feature components). 

Finally, we need to choose the optimal regularization parameter à > 0. 
This is done by cross-validation. We exploit the generalized cross-validation loss, 
see (5.67), and the hat matrix in this ridge regularized case is given by 


ridge 


H, = we ridge 


PRTG Sw ye. 
In contrast to (5.66), this hat matrix H, is not a projection but we would need to 
work in an augmented model to receive the projection property (accounting for the 
regularization part). 

Figure 6.4 plots the generalized cross-validation loss as a function of A > 0. 
We observe the minimum in parameter A = e~°4. The resulting generalized cross- 
validation loss is 1.76742. This is bigger than the one received in model Gamma 


3 The R command glmnet [142] allows for regularized MLE, however, the current version does 
not include the gamma distribution. Therefore, we have implemented our own routine. 
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Fig. 6.4 Generalized generalized cross-validation loss 
cross-validation loss 
DOCV (A) as a function of S 

k 
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GCV losses 
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log(lambda) 


GLM2, see Table 5.16, thus, we still prefer model Gamma GLM?2 over the optimally 
ridge regularized model GLM1. Note that for model Gamma GLM2 we did variable 
selection, whereas ridge regression just generally shrinks regression parameters. 
For more interpretation we refer to Example 6.8, below, which considers LASSO 
regularization. a 


6.2.4 LASSO Regularization 


In this section we consider LASSO regularization (p = 1). This is more chal- 
lenging than ridge regularization because of the non-differentiability of the budget 
constraint, see Fig.6.1 (rhs). This section follows Chapters 2 and 5 of Hastie et 
al. [184] and Parikh—Boyd [292]. 


Gaussian Case 


We start with the homoskedastic Gaussian model having unit variance ø? = 1. Ina 
first step, the regression model only involves one feature component q = 1. Thus, 
we aim at solving LASSO optimization 


n 

LASSO 1 

B = argmax E ey ON 
i= 


PER? 


We standardize the observations and features (Y;,xi)i<i<n such that we have 
Wie Yi = 0, $; x; = 0 and n~! $? x? = 1. This implies that we can omit 
the intercept parameter Bo, as the optimal intercept satisfies for this standardized 
data (and any 6; € R) 


Ba ee 
Bo = 73 Y; — Bix; = 0. (6.11) 
i= 
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Thus, w.l.o.g., we assume to work with standardized data in this section, this gives 
us the optimization problem (we drop the lower index in 6; because we only have 
one component) 


1 n 
BASSO — BrASSOQ) = argmax —— XO (Y; — bxi)? Alb]. (6.12) 
PER 2n i=l 


The difficulty is that the regularization term is not differentiable in zero. Since this 
term is convex we can express its derivative in terms of a sub-gradient s. This 
provides score 


i=1 


a ee ee 1 
a (-zE«: -nož =a =-=) (i — bxi) xi — às = (Y, x) — 8 — às, 


where we use standardization n~! Da ia = | in the second step, (Y, x) is the 
scalar product of Y,x = (%1,..., Xn)! € R”, and where we consider the sub- 
gradient 
+1 if B > 0, 
s=5(6)= {4-1 if B < 0, 


e [-1, 1] otherwise. 


Henceforth, we receive the score equation for 6 Æ 0 


n—'(Y,x) —B—As=n|(Y,x) — B—sign(B)A = 


This score equation has a proper solution B > Oifn—'(¥,x) > A, and it has a 
proper solution B < O0ifn7!(¥,x) < —A. In any other case we have a boundary 
solution p= = 0 for our maximization problem (6.12). 


This solution can be written in terms of the following soft-thresholding 
operator for à > 0 


Brass = Sh (n(Y, x)) with SG) = sign) (lx — A) 
(6.13) 


This soft-thresholding operator is illustrated in Fig. 6.5 for à = 4. 
This approach can be generalized to multiple feature components x € RZ. 
We standardize the observations and features );—;1 Y; = 0, $j- X; j = 0 and 
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Fig. 6.5 Soft-thresholding soft-thresholding operator 
operator x +> S)(x) for ; 7 
à = 4 (red dotted lines) 
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n! yer x;, = l forall 1 < j < q. This allows us again to drop the intercept 

term and to directly consider 
2 
LASSO LASSO 
B = B 7 (a) = argmax > Sa — AllBlli- 
BERI i=l j=l 


Since this is a concave (quadratic) maximization problem with a separable (convex) 
penalty term, we can apply a cycle coordinate descent method that iterates a cyclic 
coordinate-wise maximization until convergence. Thus, if we want to maximize 
in the t-th iteration the j-th coordinate of the regression parameter we consider 
recursively 


2 
j— 


py? = arg max =) i EA Sf eg) AB 


BjeR i=1 l=j+1 
Using the soft-thresholding operator (6.13) we find the optimal solution 


j-l 


q 
t = t t—1 
ETNIE r-a- S px x), 


l=j+1 


with vectors x; = (x1/,..., ei € R” for 1 < l < q. Iteration until convergence 


provides the LASSO regularized estimator a for given regularization 
parameter A > 0. 
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Typically, we want to explore Yaa (A) for multiple 4’s. For this, one runs 
a pathwise cyclic coordinate descent method. We start with a large value for A, 
namely, we define 


amex — max n`! Y, x;)|. 
I<j<q l 
For à > A™*, we have pO (à) = 0, i.e., we have the null model. Pathwise cycle 


coordinate descent starts with this solution for 49 = A™**. In a next step, one slightly 


decreases Ag and runs the cyclic coordinate descent algorithm until convergence for 


this slightly smaller 4; < Ao, and with starting value a Oo). This is then 


iterated for A;41 < àr, t > 0, which provides a sequence of LASSO regularized 


estimators BaS (àr) along the path (A;);>0. 

For further remarks we refer to Section 2.6 in Hastie et al. [184]. This concerns 
statements about uniqueness for general design matrices, also in the set-up where 
q > n,i.e., where we have more parameters than observations. Moreover, references 
to convergence results are given in Section 2.7 of Hastie et al. [184]. This closes the 
Gaussian case. 


Gradient Descent Algorithm for LASSO Regularization 


In Sect. 7.2.3 we will discuss gradient descent methods for network fitting. In this 
section we provide preliminary considerations on gradient descent methods because 
these are also useful to fit LASSO regularized parameters within GLMs (different 
from Gaussian GLMs). Remark that we do a sign switch in what follows, and we 
aim at minimizing an objective function g. 

Choose a convex and differentiable function g : Rat! —> R. Assuming that 
the global minimum of g is achieved, a necessary and sufficient condition for the 
optimality of B* €e R1+! in this convex setting is Veg(B)lg=px = 0. Gradient 
descent algorithms find this optimal point by iterating for t > 0 


B®? > BOY = BO — a4: Vea(B™), (6.14) 


for tempered learning rates 0:41 > 0. This algorithm is motivated by a first order 
Taylor expansion that determines the direction of the maximal local decrease of the 
objective function g supposed we are in position £, i.e., 


gB) = (B) + Vee)" (B — B) +0 (IB — Bl2) as IB — Bln > 0. 


The gradient descent algorithm (6.14) leads to the (unconstraint) minimum of the 
objective function g at convergence. A budget constraint like (6.6) leads to a convex 
constraint B € C C R1+!, Consideration of such a convex constraint requires 
that we reformulate the gradient descent algorithm (6.14). The gradient descent 
step (6.14) can also be found, for given learning rate 9;+1, by solving the following 
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Fig. 6.6 Projected gradient projected gradient descent 
descent step, first, mapping @ beta 

B~” to the unconstraint | @ unconstraint 
solution @ betar 
B® — o141Vpe(B) 

of (6.15) and, second, 
projecting this unconstraint 
solution back to the convex 
set C giving CtP; see also 
Figure 5.5 in Hastie et 

al. [184] 


15 


linearized problem for g with the Euclidean square distance penalty term (ridge 
regularization) for too big gradient descent steps 


arg min [ego + Vgg(B®)T (B - B9) + =I = goi} . (6.15) 
BERI+ Qt+1 


The solution to this optimization problem exactly gives the gradient descent 
step (6.14). This is now adapted to a constraint gradient descent update for convex 
constraint C: 


B“) = argmin flego +VggB®T (BB) + — I8 - BoI) 
BeC 20r+1 


(6.16) 


The solution to this constraint convex optimization problem is obtained by, first, 
taking an unconstraint gradient descent step B > B® — 0,41Vgg(B™), and, 
second, if this step is not within the convex set C, it is projected back to C; this is 
illustrated in Fig. 6.6, and it is called projected gradient descent step (justification 
is given in Lemma 6.6 below). Thus, the only difficulty in applying this projected 
gradient descent step is to find an efficient method of projecting the unconstraint 
solution (6.14)-(6.15) back to the convex constraint set C. 

Assume that the convex constraint set C is expressed by a convex function 
h (not necessarily being differentiable). To solve (6.16) and to motivate the 
projected gradient descent step, we use the proximal gradient method discussed in 
Section 5.3.3 of Hastie et al. [184]. The proximal gradient method helps us to do 
the projection in the projected gradient descent step. We introduce the generalized 
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projection operator, for z € RI! 


1 
prox, (z) = arg min {5 lz — Bly + n) . (6.17) 


BeRI+! 


This generalized projection operator should be interpreted as a square minimization 
problem ||z — £ lie /2 on a convex set C being expressed by its dual Lagrangian 
formulation described by the regularization term h(f). The following lemma shows 
that the generalized projection operator solves the Lagrangian form of (6.16). 


Lemma 6.6 Assume the convex constraint C is expressed by the convex function h. 
The generalized projection operator solves 


porn = PrOxXa ih (6 = or Veg (B®) (6.18) 


1 
= arg min [ego + Vpa(B)" (B-60) + zag! BP a + np} . 


BeR@t1 


Proof of Lemma 6.6 It suffices to consider the following calculation 


5 |B. — om YgeB®) -B| + orh 


= Foa |Vee8>), - or (VB, 8 — 8) +5 [8 — Bl, + omir 


1 
ze | Ves(B) 


2 1 2 

T/g- go ® _ 
[+ em (vese (8-8) + — |B p|, +B). 
This is exactly the right objective function (in the round brackets) if we ignore all 
terms that are independent of B. This proves the lemma. o 


Thus, to solve the constraint optimization problem (6.16) we bring it into its dual 
Lagrangian form (6.18). Then we apply the generalized projection operator to the 
unconstraint solution to find the constraint solution, see Lemma 6.6. This approach 
will be successful if we can explicitly compute the generalized projection operator 
prox, (-). 


Lemma 6.7 The generalized projection operator (6.17) satisfies for LASSO 
constraint h(B) = i||B_||1 


prox;(z) = SEASS°(z) Œ (zo, sign(zi) (izil — A)4,--- Sigm(Zq)(lzql — A+) |, 


forz e RIH, 
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Proof of Lemma 6.7 We need to solve for function B > A(R) = Al|B_||1 


i 1 f 1 q q 
proxon @) = argmin {I — BIZ + AlB- | = arg min j 5 X eB FADO IBF. 
BERI! BeR@! j=0 j=l 


This decouples into q + 1 independent optimization problems. The first one is 
solved by Bo = zo and the remaining ones are solved by the soft-thresholding 
operator (6.13). This finishes the proof. o 


We conclude that the constraint optimization problem (6.16) for the (convex) 
LASSO constraint C = {£; ||B_||1 < c} is brought into its dual Lagrangian 
form (6.18) of Lemma 6.6 with h(B) = à||8_|lı for suitable A = å (c). The LASSO 
regularized parameter estimation is then solved by first performing an unconstraint 
gradient descent step B > B® — o1+1Vgg (B®), and this updated parameter is 
projected back to C using the generalized projection operator of Lemma 6.7 with 


A(B) = ar+1Al|B_lh1- 


Proximal gradient descent algorithm for LASSO 


1. Make the gradient descent step for a suitable learning rate 0,41; > 0 


s(t+1) 
BO > BET? = BO — a4: Veg(B). 


2. Perform soft-thresholding of the gradient descent solution 


(ttl) (+1) _ LASSO (RE+D 
B > per = Sopa (B l 
where the latter soft-thresholding function is defined in Lemma 6.7. 
3. Iterate these two steps until a stopping criterion is met. 


If the gradient Vg ¢(-) is Lipschitz continuous with Lipschitz constant L > 0, the 
proximal gradient descent algorithm will converge at rate O(1/t) for a fixed step 
size 0 < 0 = O741 < L, see Section 4.2 in Parikh—Boyd [292]. 


Example 6.8 (LASSO Regression) We revisit Example 6.5 which considers claim 
size modeling using model Gamma GLM1. In order to apply the proximal gradient 
descent algorithm for LASSO regularization we need to calculate the gradient of 
the negative log-likelihood. In the gamma case with log-link, it is given by, see 
Example 5.5, 


—Vgly(B) = —X'W(B)R(Y, P) 


= x" diag (7...) (Fetes aa). 
Q Q Hi Um 
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LASSO regularization: in-sample losses LASSO regularization: regression parameters 
i ito} 
OY == LASSO iregularizedi|sr9"s seers sree seme gi -] 
Gamma GLM1 
==s2 gamma null 3 
oe || 
N 
w 
[e] 
k “I 
Q o Bo 
2-7 a S] 
2 a 


oooooooo 
eegagagg 
BS oF oF Bo A a 
ONOORwWM 


IR 
> 
\ 
o 
l 
œ 
I 
4 
j 
D 
\ 
> 
\ 
o 
l 
œ 
1 
1 
j 
© 


log(lambda) log(lambda) 


Fig. 6.7 LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a function 
of the regularization parameter à > 0, (ths) resulting maa (A) forl<j<q 


where m € N is the number of policies with claims, and u; = j;(B) = exp(B, xi). 
We set o = | as this constant can be integrated into the learning rates 0;+1. 

We have implemented the proximal gradient descent algorithm ourselves using 
an equidistant grid for the regularization parameter à > 0, a fixed learning rate 
Qr+1 = 0.05 and normalized features. Since this has been done rather brute force, 
the results presented in Fig. 6.7 look a bit wiggly. These results should be compared 
to Fig. 6.3. We see that, in contrast to ridge regularization, less important regression 
parameters are shrunk exactly to zero in LASSO regularization. We give the order 
in which the parameters are shrunk to zero: 6; (OwnerAge), 64 (RiskClass), 
Be (VehAge’), Bs (BonusClass), 67 (GenderMale), f2 (OwnerAge?’), B3 
(AreaGLM) and fs (VehAge). In view of Listing 5.11 this order seems a bit 
surprising. The reason for this surprising order is that we have grouped features 
here, and, obviously, these should be considered jointly. In particular, we first drop 
OwnerAge because this can also be partially explained by OwnerAge’, therefore, 
we should not treat these two variables individually. Having this weakness in mind 
supports the conclusions drawn from the Wald tests in Listing 5.11, and we come 
back to this in Example 6.10, below. 

a 
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Oracle Property 


An interesting question is whether the chosen regularization fulfills the so-called 
oracle property. For simplicity, we assume to work in the normalized Gaussian 
case that allows us to exclude the intercept Bo, see (6.11). Thus, we work with a 
regression parameter B € R1. Assume that there is a true data model that can be 
described by the (true) regression parameter B* € R1. Denote by A* = {j € 
{1,..., gq}; Bj + 0} the set of feature components of x € R1 that determine the 


true regression function, and we assume |.A*| < q. Denote by B, (à) the parameter 
estimate that has been received by the regularized MAP estimation for a given 
regularization parameter à > 0 and based on i.i.d. data of sample size n. We say 
that B,, (An))nen fulfills the oracle property if there exists a sequence (Ay)nen of 
regularization parameters à„ > 0 such that 


lim PI, = A*] = 1, (6.19) 
Jit (By, A An) — B*e) > N (0, 3a) asi — oo, (6.20) 


where A, = {j € {1,...,q}; B,, (An)); # 0}, B 4 only considers the components 
in A C {l,...,q}, and T4» is Fisher’s information matrix on the true feature 
components. The first oracle property (6.19) tells us that asymptotically we choose 
the right feature components, and the second oracle property (6.20) tells us that 
we have asymptotic normality and, in particular, consistency on the right feature 
components. 

Zou [408] states that LASSO regularization, in general, does not satisfy the 
oracle property. LASSO regularization can perform variable selection, however, as 
Zou [408] argues, there are situations where consistency is violated and, therefore, 
the oracle property cannot hold in general. Zou [408] therefore proposes an 
adaptive LASSO regularization method. Alternatively, Fan—Li [124] introduced 
smoothly clipped absolute deviation (SCAD) regularization which is a non-convex 
regularization that possesses the oracle property. SCAD regularization of B is 
obtained by penalizing 


|Bjl? — 2aa|Bj| + 2? (a+ 1)a2 
Fan Leesa + —— Hip ;\>a0), 


q 
DB) = XMB; Ligji — 2 


j=l 


for a hyperparameter a > 2. This function is continuous and differentiable except 
in B; = 0 with partial derivatives for B > 0 
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LASSO soft-thresholding operator 


SCAD soft-thresholding operator 


i i 5 i i 
ç 4 ! -— LASSO soft-thresholding v4 ! !— SCAD soft-thresholding 


T T T T T T T T 
-20 -10 0 10 20 -20 -10 0 10 20 


Fig. 6.8 (lhs) LASSO soft-thresholding operator x +> S(x) for à = 4 (red dotted lines), (rhs) 
SCAD thresholding operator x > SEAP (x) for à = 4 anda = 3 


Thus, we have a constant LASSO-like slope à > 0 for 0 < $ < A, shrinking some 
components exactly to zero. For 6 > ad the slope is 0, removing regularization, and 
it is concatenated between the two scenarios. The thresholding operator for SCAD 
regularization is given by, see Fan—Li [124], 


sign(x)(|x| —A)4 for |x| < 2A, 
SSCAD (x) = CE a for 2A < |x| < ad, 
x for |x| > ad. 


Figure 6.8 compares the two thresholding operators of LASSO and SCAD. 

Alternatively, we propose to do variable selection with LASSO regularization in 
a first step. Since the resulting LASSO regularized estimator may not be consistent, 
one should explore a second regression step where one uses an un-penalized 
regression model on the LASSO selected components, we also refer to Lee et al. 
[237]. 


6.2.5 Group LASSO Regularization 


In Example 6.8 we have seen that if there are natural groups within the feature 
components they should be treated simultaneously. Assume we have a group 
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structure x = (xo,X1,...,XK) with groups xx € R% that should be treated 
simultaneously. This motivates the grouped penalties proposed by Yuan—Lin [398], 
see (6.5), 


K 
A A 1 
eer = BPA) = argmax ey) -AY lB, (6.29) 
B=(60.B1,---.Bx) ” kel 


where we assume a group structure in the linear predictor providing 


K 
x +> n(x) = (B,x) = Bo + > (Br xr). 


k=1 


LASSO regularization is a special case of this grouped regularization, namely, if 


all groups 1 < k < K only contain one single component, i.e., K = q, we have 
LASSO 


ee G B 

The side constraint in (6.21) is convex, and the optimization problem (6.21) 
can again be solved by the proximal gradient descent algorithm. That is, in view 
of Lemma 6.6, the only difficulty is the calculation of the generalized projection 
operator for regularization term h(B) = à ya 1 Bx ll2. We therefore need to solve 


for z = (z0, Z1,---, ZK), Zk E RÝ, 


K 
. 1 
proxy, (Z) = argmin  } 5 liz- BIIZ+AD IBk 
B=(Bo.B yeee Bx) k=1 


1 
= (o (temin f3 J a+ neue} 
BERIK 1<k<K 


The latter highlights that the problem decouples into K independent problems. Thus, 
we need to solve for all 1 < k < K the optimization problems 


1 
arg min {5 lx — By I; F Bul} . 
By ER 
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Lemma 6.9 The group LASSO generalized soft-thresholding operator satis- 
fies for zg € R% 


4 1 D À 
Stev =agmin| 5 ax = Bel +alle] =zk (1-a) c RK, 
Rak (2 lzkll2/ + 


BERT 


and for the generalized projection operator for h(B) = AYÉ; ||Bxll2 we 
have 


prox, (z) = SEP (z) Œ (zo, S44 (z1), -.., S1 (ZK), 


forz = (z0, Z1, ..., ZK) with zg E€ R&. 
Proof We prove this lemma. In a first step we have 


Bi=oz/lzxl2, 020 (2 llzzll2 


_ f1 2 1 2 
argmin {> [zx = Bul + cto] = argmin (i(i -z taa). 
By 


: : 2 2 
this follows because the square distance || ze — By I; = Izl — 2(zk, Bk) + l By I; 
is minimized if zg and B; point into the same direction. Thus, there remains the 
minimization of the objective function in @ > 0. The first derivative is given by 


a (1 . 
— | = Izell5 (1 cpt ) +0} = — lZell2 (1 ge Jra = à- |lzkll2 +0- 
ðo \2 lizzll2 IZxll2 


If ||Zxll2 > A we have 9 = ||zx|l2 — A > 0, and otherwise we need to set ọ = 0. This 
implies 


SF Be) = (Zilla — 44 Ze/lZell2- 


This completes the proof. o 
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Fig. 6.9 Group LASSO regularized MLEs in model Gamma GLM1: (lhs) in-sample losses as a 
function of the regularization parameter à > 0, (rhs) resulting aad (A) forl<j<q 


Proximal gradient descent algorithm for group LASSO 


1. Make the gradient descent step for a suitable learning rate 0,4; > 0 


~(t+1) 


BO > BO’ = B® — orm Veg (B). 


2. Perform soft-thresholding of the gradient descent solution 


9 (ttl) (t+1) _ egroup /Z(t+1) 
B >B ~ Saath G ) i 


where the latter soft-thresholding function is defined in Lemma 6.9. 
3. Iterate these two steps until a stopping criterion is met. 


Example 6.10 (Group LASSO Regression) We revisit Example 6.8 which considers 
claim size modeling using model Gamma GLM1. This time we group the variables 
OwnerAge and OwnerAge? ($1, 62) as well as VehAge and VehAge? (fs, fo). 
The results are shown in Fig. 6.9. 

The order in which the parameters are regularized to zero is: 64 (RiskClass), 
Bg (BonusClass), 67 (GenderMale), (81, 62) (OwnerAge, OwnerAge’), 63 
(AreaGLM) and (5, Bg) (VehAge, VehAge’). This order now reflects more the 
variable importance as received from the Wald statistics of Listing 5.11, and it 
shows that grouped features should be regularized jointly in order to determine their 
importance. a 
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6.3 Expectation-Maximization Algorithm 


6.3.1 Mixture Distributions 


In many applied problems there does not exist a simple off-the-shelf distribution 
that is suitable to model the whole range of observations. We think of claim size 
modeling which may range from small to very large claims; the main body of the 
data may look like, say, gamma distributed, but the tail of the data being regularly 
varying. Another related problem is that claims may come from different insurance 
policy modules. For instance, in property insurance, one can insure water damage, 
fire, glass and theft claims on the same insurance policy, and feature information 
about the claim type may not always be available. In such cases, it looks attractive 
to choose a mixture or a composition of different distributions. In this section we 
focus on mixtures. 

Choose a fixed integer K bigger than 1 and define the (K — 1)-unit simplex 
excluding the edges by 


K 
Ax = } p € (0, 1: Yazi}. (6.22) 
k=1 

Ax defines the family of categorical distributions with K levels (all levels having 
a strictly positive probability). These distributions belong to the vector-valued 
parameter EF which we have met in Sects. 2.1.4 and 5.7. 

The idea behind mixture distributions is to mix K different distributions with a 
mixture probability p € Ag. For instance, we can mix K different EDF densities 
fk by considering 


K K 

Ok — Kk (Ok) 

Y~ ` Prf; Ok, v/g) = > ee vie} 
k=1 k=1 


(6.23) 


with cumulant functions 0, € Ox > Kx(O%), exposure v > 0 and dispersion 
parameters o > 0,forl<k< K. 

At the first sight, this does not look very spectacular and parameter estimation 
seems straightforward. If we consider the log-likelihood of n independent random 
variables Y = (Y1, ..., Yp)! following mixture density (6.23) we receive log- 
likelihood function 


n n K 
(0, p) +> ly, p) = X Ly, (0, p) = > log (>: Pr fe Vis Ok, 79 l 
i=l i=l k=1 
(6.24) 
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for canonical parameter 0 = (61,..., Ox)! ce © = 890, x- x Ox and mixture 
probability p € Ax. Unfortunately, MLE of (0, p) in (6.24) is not that simple. 
Note, the summation over 1 < k < K is inside of the logarithmic function, and 
the use of the Newton—Raphson algorithm may be cumbersome. The Expectation- 
Maximization (EM) algorithm presented in Sect.6.3.3, below, makes parameter 
estimation feasible. In a nutshell, the EM algorithm leads to a sequence of parameter 
estimates for (0, p) that monotonically increases the log-likelihood in each iteration 
of the algorithm. Thus, we can receive an approximation to the MLE of (0, p). 

Nevertheless, model fitting may still be difficult for the following reasons. Firstly, 
the log-likelihood function of a mixture distribution does not need to be bounded, 
we highlight this in Example 6.13, below. In that case, MLE is not a well-defined 
problem. Secondly, even in very simple situations, the log-likelihood function (6.24) 
can have multiple local maximums. This usually happens if the data is clustered 
and the clusters are well separated. In that case of multiple local maximums, 
convergence of the EM algorithm does not guarantee that we have found the global 
maximum. Thirdly, convergence of the log-likelihood function through the EM 
algorithm does not guarantee that also the sequence of parameter estimates of (6, p) 
converges. The latter needs additional examination and regularity conditions. 

Figure 6.10 (lhs) shows a density of a mixture distribution mixing K = 3 gamma 
densities with shape parameters a, = 1, 20, 40 (orange, green and blue) and mixture 
probability p = (0.7, 0.1, 0.2)'; the mixture components are already multiplied 
with p. The resulting mixture density in red color is continuous. Figure 6.10 (rhs) 
replaces the blue gamma component of the plot on the left-hand side by a Pareto 
component (in blue). As a result we observe that the resulting mixture density in 
red is no longer continuous. This example is often used in practice, however, the 
discontinuity may be a serious issue in applications and one may use a Lomax 
(Pareto Type II) component instead, we refer to Sect. 2.2.5. 
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Fig. 6.10 (lhs) Mixture distribution mixing three gamma densities, and (rhs) mixture distributions 
mixing two gamma components and a Pareto component with mixture probabilities p = 
(0.7, 0.1, 0.2)" for orange, green and blue components (the density components are already 
multiplied with p) 
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6.3.2 Incomplete and Complete Log-Likelihoods 


A mixture distribution can be defined (brute force) by just defining a mixture 
density as in (6.23). Alternatively, we could define a mixture distribution in a more 
constructive way. In the following we discuss this constructive derivation which will 
allow us to efficiently fit mixture distributions to data Y. For our outline we focus 
on (6.23), but all results presented below hold true in much more generality. 

Choose a categorical random variable Z with K > 2 levels having probabilities 
P[Z = k] = pk > Oforl < k < K, that is, with p € Ax. The main idea is to 
sample in a first step level Z = k € {1,..., K}, and in a second step Y|įz=4 ~ 
fk(Y; Ok, V/k), based on the selected level Z = k. The random tuple (Y, Z) has 
joint density 


(Y, Z) ~ fo pO, k) = Pk fk; Ok, v/Gx), 


and the marginal density of Y is exactly given by (6.23). In this interpretation we 
have a hierarchical model (Y, Z). If only Y is available for parameter estimation, 
then we are in the situation of incomplete information because information about 
the first hierarchy Z is missing. If both Y and Z are available we say that we have 
complete information. 

For the subsequent derivations we use a different coding of the categorical 
random variable Z, namely, Z can be represented in the following one-hot encoding 
version 


Z=(Z1,...,Zx)' = (liz=1} ---, Uzaxy', (6.25) 


these are the K corners of the (K — 1)-unit simplex Ax. One-hot encoding differs 
from dummy coding (5.21). One-hot encoding does not lead to a full rank design 
matrix because there is a redundancy, that is, we can drop one component of Z 
and still have the same information. One-hot encoding Z of Z allows us to extend 
the incomplete (data) log-likelihood £y (0, p), see (6.23)-(6.24), under complete 
information (Y, Z) as follows 


K 
€y,z) (0, p) = log (i1 (Dk fk(Y; Ok, via) 


k=1 
x YOu — K (0 Zk 

= log (ii Q exp jaa + al; ve }) (6.26) 
a4 P/V 


K 
YOu — kg (0 
= 5 Ze (e $ YOK — K(k) +a’; v/e) ; 
k=1 p/v 
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€y,z)(9, p) is called complete (data) log-likelihood. As a consequence of this last 
expression we observe that under complete information (Y;, Z;)1<j<n, the MLE 
of 0 and p can be determined completely analogously to above. Namely, Og is 
estimated from all observations Y; for which Z; belongs to level k, and the level 
indicators (Z;)1<i<n are used to estimate the mixture probability p. Thus, the 
objective function nicely decouples under complete information into independent 
parts for 6; and p estimation. There remains the question of how to fit this model 
under incomplete information Y. The next section will discuss this problem. 


6.3.3 Expectation-Maximization Algorithm for Mixtures 


The EM algorithm is a general purpose tool for parameter estimation under 
incomplete information. The EM algorithm has been introduced within the EF by 
Sundberg [348, 349]. Sundberg’s developments have been based on the vector- 
valued parameter EF with statistics S(Y) € R*, see (3.17), and he solved the 
estimation problem under the assumption that S(Y) is not fully known. These results 
have been generalized to MLE under incomplete data in the celebrated work of 
Dempster et al. [96] and Wu [385]. The monograph of McLachlan—Krishnan [267] 
gives the theory behind the EM algorithm, and it also provides a historical review 
in Section 1.8. In actuarial science the EM algorithm is increasingly used to solve 
various kinds of problems of incomplete data. Mixture models of Erlang distribu- 
tions are considered in Lee—Lin [240], Yin—Lin [396] and Fung et al. [146, 147]; 
general Erlang mixtures are universal approximators to positive distributions (in the 
weak convergence sense), and regularized Erlang mixtures and mixtures of experts 
models are determined using the EM algorithm to receive approximations to the 
true underlying model. Miljkovic—Griin [278], Parodi [295] and Fung et al. [148] 
consider the EM algorithm for mixtures of general distributions, in particular, 
mixtures of small and large claims distributions. Verbelen et al. [371], Blostein— 
Miljkovic [40], Griin—Miljkovic [173] and Fung et al. [147] use the EM algorithm 
for censored and/or truncated observations, and dispersion modeling is performed 
with the EM algorithm in Tzougas—Karlis [359]. (Inhomogeneous) phase-type and 
matrix Mittag—Leffler distributions are fitted with the EM algorithm in Asmussen 
et al. [14], Albrecher et al. [8] and Bladt [37], and the EM algorithm is used to 
fit mixture density networks (MDNs) in Delong et al. [95]. Parameter uncertainty is 
investigated in O’ Hagan et al. [289] using the bootstrap method. The present section 
is mainly based on McLachlan—Krishnan [267]. 

As mentioned above, the EM algorithm is a general purpose tool for parameter 
estimation under incomplete data, and we describe the variant of the EM algorithm 
which is useful for our mixture distribution setup given in (6.26). We give a 
justification for its functioning below. The EM algorithm is an iterative algorithm 
that performs a Bayesian expectation step (E-step) to infer the latent variable Z, 
given the model parameters and Y. Next, it performs a maximization step (M-step) 
for MLE of the parameters given the observation Y and the estimated latent variable 
Z. More specifically, the E-step and the M-step look as follows. 
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e E-step. Calculate the posterior probability of the event that a given 
observation Y has been generated from the k-th component of the mixture 
distribution. Bayes’ rule allows us to infer this posterior probability (for 
given 0 and p) from (6.26) 


Pr fk(Y; Ok, V/ PK) 


a ge L 
LÉ pifi; 6, v/i) 


The posterior (Bayesian) estimate for Zg after having observed Y is given 
by 


Z0, ply) = Eo plZklY] = Po plZk = 1|Y]  forl<k<K. 


(6.27) 
This posterior mean Z = ZO, p\Y) = (Zi (0, Ya) eee Zr (0, p\Y))' € 
Ax is used as an estimate for the (unobserved) latent variable Z; note that 
this posterior mean depends on the unknown parameters (0, p). 
e M-step. Based on Y and Z the parameters 0 and p are estimated with 
MLE. 


Alternation of these two steps provide the following recursive algorithm. We 
assume to have independent responses (Y;, Z;), | < i < n, following the mixture 
distribution (6.26), where, for simplicity, we assume that only the volumes v; > 0 
are dependent on i. 


EM algorithm for mixture distributions 
(0) Choose an initial parameter oe”, pO ye Ox Ax. 
(1) Repeat for t > 1 until a stopping criterion is met: 


e E-step. Given parameter @'"? pe =D) € © x Ax estimate the latent 
variables Z;, 1 <i < n, by their conditional expectations, see (6.27), 


os & (Atl) nye + 
20 = 2 (0" x >) ¥:) = Eqe-n genlZil¥il € Ax. (6.28) 


e M-step. Calculate the MLE 6” PO) € © x Ax based on (complete) 
observations ((Yj, 20), T O AT ZO), i.e., solve the score equations, 
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see (6.26), 


n K 
v( >» Te 0, (6.29) 


i=] k=1 94 /Vi 


n K 
Vp y 22 losd) = 0 (6.30) 


i=1 k=1 


where p_ = (pj,..., pK—1)! and setting pg = 1 — iar pk € (0, 1). 


Remarks 6.11 


The E-step uses Bayes’ rule. This motivates to consider the EM algorithm in this 
Bayesian chapter; alternatively, it also fits to the MLE chapters. 

We have formulated the M-step in (6.29)-(6.30) in a general way because the 
canonical parameter 0 and the mixture probability p could be modeled by 
GLMs, and, henceforth, they may be feature x; dependent. Moreover, (6.29) is 
formulated for a mixture of single-parameter EDF distributions, but, of course, 
this holds in much more generality. 

Equations (6.29)-(6.30) are the score equations received from (6.26). There is 
a subtle point here, namely, Z% € {0,1} in (6.26) are observations, whereas 
Z“) e (0,1) in (6.29)-(6.30) are their estimates. Thus, in the EM algorithm 
the unknown latent variables are replaced by their estimates which, in our setup, 
results in two different types of variables with disjoint ranges. This may matter 
in software implementations, for instance, a categorical GLM may ask for a 
categorical random variable Z € {1,..., K} (of factor type), whereas Z is 
in the interior of the unit simplex Ax. 

For mixture distributions one can replace the latent variables Z; by their 
conditionally expected values Zi , see (6.29)-(6.30). In general, this does not hold 
true in EM algorithm applications: in our case we benefit from the fact that Zg 
influences the complete log-likelihood linearly, see (6.26). In the general (non- 
linear) case of the EM algorithm application, different from mixture distribution 
problems, one needs to calculate the conditional expectation of the log-likelihood 
function. 
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e If we calculate the scores element-wise we receive 


d VO — KEK) 
OOK i=] p/i Z) 


n 


— D (Zip 1og(p) + Z$% log(px)) =0 
OPk i=l 
recall normalization px = 1 — pe pk € (0, 1). 

From the first score equation we see that we receive the classical MLE/GLM 
framework, and all tools introduced above for parameter estimation can directly 
be used. The only part that changes are the weights v; > viZ; Ze k In the 
homogeneous case, i.e., in the null model we have MLE after the t-th ‘eration of 
the EM algorithm 


n F(t) 
Poh Vint WZ; {Yi 
pe aA vz i 

i=1 VIF; 


where hg is the canonical link that corresponds to cumulant function «g. 
If we choose the null model for the mixture probabilities we receive MLEs 


Is 
-5029 forl <k<K. (6.31) 


In Sect. 6.3.4, below, we will present an example that uses the null model for 
the mixture probabilities p, and we present an other example that uses a logistic 
categorical GLM for these mixture probabilities. 


Justification of the EM Algorithm So far, we have neither given any argument 
why the EM algorithm is reasonable for parameter estimation nor have we said 
anything about convergence. The purpose of this paragraph is to justify the above 
EM algorithm. We aim at solving the incomplete log-likelihood maximization 
problem, see (6.24), 


n 

Tha gal 

Op") = argmax fy (0, p) = argmax ) “log 5 Dk fk (Yi; Ok, Vi / pk) 
@,p) (9.Pp) i= k=1 


subject to existence and uniqueness. We introduce some notation. Let f(y, z; 0, p) 
= exp{lo,z) (0, p)} be the joint density of (Y, Z) and let f(y;0, p) = 
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exp{l,(0, p)} be the marginal density of Y. This allows us to rewrite the incomplete 
log-likelihood as follows for any value of z 


f, z; rar) 


ly (0, p) = log f (Y; 8, p) = log Glens 


thus, we bring in the complete log-likelihood by using Bayes’ rule. Choose an 
arbitrary categorical distribution z € Ax with K levels. We have using the previous 
step 


£y(0, p) = log f(Y;0, p) = >) 7(z) log fY; 9, p) 


= dm) tog (SOS 0, pine) 


E fO,z;0, p) r (z) 
= nt) log (ee) a 2 mI og (arr) 


Y, z; 0, 
-Lro = a P) 4 Da CIO: 6, p) (6.32) 


Y, z; 0, 
> Xo x) log (ee) ; 


the inequality follows because the KL divergence is always non-negative, see 
Lemma 2.21. This provides us with a lower bound for the incomplete log-likelihood 
£y (0, p) for any categorical distribution m € Ax and any (0, p) € © x Ax: 


ly (0, p) 


IV 


Droi (en 7 ee ”) (6.33) 


= Ez~z [&v,2)@, p)| Y] — rolero) = " Q@, p; 7). 


Thus, we have a lower bound Q (0, p; 7) on the incomplete log-likelihood £y (0, p). 
This lower bound is based on the conditionally expected complete log-likelihood 
L,z (0, p), given Y, and under an arbitrary choice x for Z. The difference between 
this arbitrary x and the true conditional posterior distribution is given by the KL 
divergence Dg (m|| f (|Y; 6, p)), see (6.32). 
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The general idea of the EM algorithm is to make this lower bound Q(0, p; zr) as 
large as possible in 6, p and x by iterating the following two alternating steps for 
t>1: 


qo = = argmax o (0 E- 1) , pU 1). z), (6.34) 


6p) = arg max Q (o, pitt ee) (6.35) 


The first step (6.34) can be solved explicitly and it results in the E-step. Namely, 
from (6.32) we see that maximizing 06"), pi); x) in x is equivalent to 
minimizing the KL divergence Deal FEY 0", p—-)) in m because the 
left-hand side of (6.32) is independent of x. Thus, we have to solve 


FO = arg max Q (6°? PED; x) = argmin Dey (x [seir 8”, pe). 
T T 


This optimization is solved by choosing the density 7 = f (|Y; on) pe; 
see Lemma 2.21, and this gives us exactly (6.28) if we calculate the cous pendne 
conditional expectation of the latent variable Z. Moreover, importantly, this step 
provides us with an identity in (6.33): 


CO PY) = Q (0, BO, RO). ey) 


The second step (6.35) then increases the right-hand side of (6.36). This second 
step is equivalent to 


6°, pO) = arg max Q (8, pir o = arg max Ez i [£y,2)(, p| Y], 
0,p 6,p 
(6.37) 


and this maximization is solved by the solution of the score equations (6.29)-(6.30) 
of the M-step. In this step we explicitly use the linearity in Z of the log-likelihood 
£vy,z), which allows us to calculate the objective function in (6.37) explicitly 
resulting in replacing Z by ZO. For other incomplete data problems, where we 
do not have this linearity, this step will be more complicated. 

Summarizing, alternating optimizations (6.34) and (6.35) gives us a sequence of 


parameters oO”, P):>0 with monotonically increasing incomplete log-likelihoods 


< yO? BY) < yO. PO) < OY POY) <... 
(6.38) 
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Therefore, the EM algorithm converges supposed that the incomplete log-likelihood 
£y (0, p) is a bounded function. 


Remarks 6.12 


e In general, the log-likelihood function (6, p) > fy(@, p) does not need to be 
bounded. In that case the EM algorithm may not converge (unless it converges 
to a local maximum). An illustrative example is given in Example 6.13, below, 
which shows what can go wrong in MLE of mixture distributions. 

¢ Even if the log-likelihood function (0, p) |> £y(0, p) is bounded, one may 
not expect a unique solution to the parameter estimation problem with the EM 
algorithm. Firstly, a monotonically increasing sequence (6.38) only guarantees 
that we have convergence of that sequence. But the sequence may not converge 
to the global maximum and different starting points of the algorithm need to 
be explored. Secondly, convergence of sequence (6.38) does not necessarily 


imply that the parameters oO”, p) converge for t —> oo. On the one hand, 
we may have an identifiability issue because the components fg of the mixture 
distribution may be exchangeable, and secondly one needs stronger conditions 
to ensure that not only the log-likelihoods converge but also their arguments 
(parameters) 6”, p). This is the point studied in Wu [385]. 

¢ Even in very simple examples of mixture distributions we can have multiple local 
maximums. In this case the role of the starting point plays a crucial role. It is 
advantageous that in the starting configuration every component k shares roughly 


the same number of observations for the initial estimates oe, Pp) and 20, 
otherwise one may start in a so-called spurious configuration where only a few 
observations almost fully determine a component k of the mixture distribution. 
This may result in similar singularities as in Example 6.13, below. Therefore, 
there are three common ways to determine a starting configuration of the EM 
algorithm, see Miljkovic—Griin [278]: (a) Euclidean distance-based initialization: 
cluster centers are selected at random, and all observations are allocated to these 
centers according to the shortest Euclidean distance; (b) K-means clustering 
allocation; or (c) completely random allocation to K bins. Using one of these 
three options, fg and p are initialized. 

e We have formulated the EM algorithm in the homogeneous situation. However, 
we can easily expand it to GLMs by, for instance, assuming that the canonical 
parameters 0, are modeled by linear predictors (8;,x) and/or likewise for 
the mixture probabilities p. The E-step will not change in this setup. For 
the M-step, we will solve a different maximization problem, however, this 
maximization problem respects monotonicity (6.38), and therefore a modified 
version of the above EM algorithm applies. We emphasize that the crucial point 
is monotonicity (6.38) that makes the EM algorithm a valid procedure. 
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6.3.4 Lab: Mixture Distribution Applications 


In this section we are going to present different mixture distribution examples that 
use the EM algorithm for parameter estimation. On the one hand this illustrates the 
functioning of the EM algorithm, and on the other hand it also highlights pitfalls 
that need to be avoided. 


Example 6.13 (Gaussian Mixture) We directly fit a mixture model to the observa- 
tion Y = (¥j,..., Yn)!. Assume that the log-likelihood of Y is given by a mixture 
of two Gaussian distributions 


n 2 
1 1 
ly(0,0, p) =) log pe—— exp | = — Yi -0 } J, 
2 2 N~ 2T OK 20 


with p € A2, mean vector 0 = (0i, 62)" € R? and standard deviations ø = 
(o, o2)! E€ RZ. Choose estimate 6 = = Y4, then we have 


1 1 K 
lim exp { -—~(Y; — 0)? t = lim 
0170 /2701 p | 20? oi —>0 ./2 TO, 


For any i # 1 we have Y; # ĝi (note that the Gaussian distribution is absolutely 
continuous and observations are distinct, a.s.). Henceforth for i # 1 


1 1 Per 1 ast 
lim — (Yi -— 0 = lim — (Y; - 0 logo, } = 0. 
Byatt rete oo | Do? i 1) | ENA =o] TEA i 1) g 1 
If we choose any 6) € R, p € Ag and op > 0, we receive for 0; = Yı 


2 


7 1 1 z 
lim £y (0,0, p) = lim lo exp } -— (Y1 — 6;)* 
Jm y( P) a (Daze e 2? 1 — %) l) 


n 
p2 1 FB \2 
+ lo ( ) — — (Y; - 0 = © 
2 g Dro 702 l 2) 


Thus, we can make the log-likelihood of this mixture Gaussian model arbitrarily 
large by fitting a degenerate Gaussian model to one observation in one mixture 
component, and letting the remaining observations be described by the other mixture 
component. This shows that the MLE problem may not be well-posed for mixture 
distributions because the log-likelihood can be unbounded. 

If the data has well separated clusters, the log-likelihood of a mixture Gaussian 
distribution will have multiple local maximums. One can construct for any given 
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number B € N a data set Y such that the number of local maximums exceeds this 
number B, see Theorem 3 in Améndola et al. [11]. | 


Example 6.14 (Gamma Claim Size Modeling) In this example we consider claim 
size modeling of the French MTPL example given in Chap. 13.1. In view of 
Fig. 13.15 this seems quite difficult because we have three modes and heavy- 
tailedness. We choose a mixture of 5 distribution functions, we choose four gamma 
distributions and the Lomax distribution 


ak 
k 


4 —(B5+1) 
aks +M 
Y~ 2 Q A k exp (Pi) + ps a (> ) , (6.39) 


M M 


with shape parameters a, and scale parameters k, 1 < k < 4, for the gamma 
densities; scale parameter M and tail parameter 5 for the Lomax density; and 
with mixture probability p € A5. The idea behind this choice is that three gamma 
distributions take care of the three modes of the empirical density, see Fig. 13.15, 
the fourth gamma distribution models the remaining claims in the body of the 
distribution, and the Lomax distribution takes care of the regularly varying tail of 
the data. For the gamma distribution, we refer to Sect. 2.1.3, and for the Lomax 
distribution, we refer to Sect. 2.2.5. 

We choose the null model for both the mixture probabilities p € A5 and the 
densities fk, 1 < k < 5. This model can directly be fitted with the EM algorithm as 
presented above, in particular, we can estimate the mixture probabilities by (6.31). 
The remaining shape, scale and tail parameters are directly estimated by MLE. To 
initialize the EM algorithm we use the interpretation of the components as explained 
above. We partition the entire data into K = 5 bins according to their claim sizes 
Y; being in (0, 300], (300, 1000], (1’000, 1'200], (1200, 5'000] or (5'000, 00). 
The first three intervals will initialize the three modes of the empirical density, 
see Fig. 13.15 (lhs). This will correspond to the categorical variable taking values 
Z = 1, 2, 3; the fourth interval will correspond to Z = 4 and it will model the main 
body of the claims; and the last interval will correspond to Z = 5, modeling the 
Lomax tail of the claims. These choices provide the initialization given in Table 6.1 
with upper indices . We remark that we choose a fixed threshold of M = 2/000 
for the Lomax distribution, this choice will be further discussed below. 

Based on these choices we run the EM algorithm for mixture distributions. We 
observe convergence after roughly 80 iterations, and the resulting parameters after 
100 iterations are presented in Table 6.1. We observe rather large shape parameters 
ar for the first three components k = 1, 2,3. This indicates that these three 
components model the three modes of the empirical density and these three modes 
collect almost ead + ae + T2 i = 50% of all claims. The remaining claims 
are modeled by the gamma density k = 4 having mean 1’304 and by the Lomax 
distribution having tail parameter gee = 1.416, thus, this tail has finite first 


A100 i) 


moment M /( = 4'812 and infinite second moment. 


242 6 Bayesian Methods, Regularization and Expectation-Maximization 


Table 6.1 Parameter choices in the mixture model (6.39) 


k=1 k=2 k=3 k=4 |k=5 
pO 0.13 0.18 0.25 0.39 0.05 
aO 2.43 11.24 1°299.44 5.63 = 
go 0.019 0.018 1.141 0.003 0.517 
nO =a (pO 125 623 1138 r763 |= 
p 0.04 0.03 0.42 0.25 0.26 
qm) 93.05 650.94 1040.37 1.34 = 
a 1.207 1.108 0.888 0.001 1.416 
Ese, ie 77 588 1172 17304 a 


Figure 6.11 shows the resulting estimated mixture distribution. It gives the 
individual mixture components (top-lhs), the resulting mixture density (top-rhs), 
the QQ plot (bottom-lhs) and the log-log plot (bottom-rhs). Overall we find a 
rather good fit; maybe the first mode is a bit too spiky. However, this plot may 
also be misleading because the empirical density plot relies on kernel smoothing 
having a given bandwidth. Thus, the true observations may be more spiky than the 
plot indicates. The third mode suggests that there are two different values in the 
observations around 1’100, this is also visible in the QQ plot. Nevertheless, the 
overall result seems satisfactory. These results (based on 13 estimated parameters) 
are also summarized in Table 6.2. 

We mention a couple of limitations of these results. Firstly, the log-likelihood 
of this mixture model is unbounded, similarly to Example 6.13 we can precisely fit 
one degenerate gamma mixture component to an individual observation Y; which 
results in an infinite log-likelihood value. Thus, the found solution corresponds 
to a local maximum of the log-likelihood function and we should not state AIC 
values in Table 6.2, see also Remarks 4.28. Secondly, it is crucial to initialize three 
components to the three modes, if we randomly allocate all claims to 5 bins as initial 
configuration, the EM algorithm only finds mode Z = 3 but not necessarily the first 
two modes, at least, in our specifically chosen random initialization this was the 
case. In fact, the likelihood value of our latter solution was worse than in the first 
calibration which shows that we ended up in a worse local maximum. 

We may be tempted to also estimate the Lomax threshold M with MLE. In 
Fig. 6.12 we plot the maximal log-likelihood as a function of M (if we start the EM 
algorithm always in the same configuration given in Table 6.1). From this figure a 
threshold of M = 1'600 seems optimal. Choosing this threshold of M = 1600 
leads to a slightly bigger log-likelihood of —199°304 and a slightly smaller tail 
parameter of fo = 1.318. However, overall the model is very similar to the one 
with M = 2’000. In general, we do not recommend to estimate M with MLE, but 
this should be treated as a hyper-parameter selected by the modeler. The reason for 
this recommendation is that this threshold is crucial in deciding for large claims 
modeling and its estimation from data is, typically, not very robust; we also refer to 
Remarks 6.15, below. 
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Fig. 6.11 Mixture null model: (top-lhs) individual estimated gamma components 
faat, pe), 1 < k < K, and Lomax component fs; BE), (top-rhs) estimated 


mixture density 2a 1 pe Sk a” : a) + a fs(; ae», (bottom-lhs) QQ plot of 
the estimated model, (bottom-rhs) log-log plot of the estimated model 


Table 6.2 Mixture models for French MTPL claim size modeling 


OP) [AIC [R= RAI" 
Empirical a ae 


Null model (M = 2000) —199°306 |398°637 |2381 
Logistic GLM (M = 2000) —198°404 [397193 |2176 


In a next step we enhance the mixture modeling by including feature information 
x; to explain the responses Y;. In view of Fig. 13.17 we have decided to only model 
the mixture probabilities p = p(x) feature dependent because feature information 
seems to mainly influence the heights of the peaks. We do not consider features 
VehPower and VehGas because these features do not seem to contribute, and 
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Fig. 6.12 Choice of Lomax mixture model: Lomax thresholds 
threshold M 
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we do not consider Density because of the high co-linearity with Area, see 
Fig. 13.12 (rhs). Thus, we are left with the features Area, VehAge, DrivAge, 
BonusMalus, VehBrand and Region. Pre-processing of these features is done 
as in Listing 5.1, except that we keep Area categorical. Using these features 
x € X C {1} x RI we choose a logistic categorical GLM for the mixture 
probabilities 


exp{Xy} 


a (6.40) 
1+ = exp{y7, x) 


x > (PE). PK-1(*))' = 


that is, we choose K = 5 as reference level, feature matrix X € R(&-)*(K-D@+b 
is defined in (5.71), and with regression parameter y = or, Stee ykp! €E 
R&-DG+); this regression parameter y should not be confused with the shape 
parameters 6,,..., 84 of the gamma components and the tail parameter 5 of the 
Lomax component, see (6.39). Note that the notation in this section slightly differs 
from Sect. 5.7 on the logistic categorical GLM. In this section we consider mixture 
probabilities p(x) € Ax=s (which corresponds to one-hot encoding), whereas 
in Sect.5.7 we model (p1(x),..., pr—1(x))" with a categorical GLM (which 
corresponds to dummy coding), and normalization provides us with px(x) = 
1- EK pix) € (0, 1). 

This logistic categorical GLM requires that we replace in the M-step 
the probability estimation (6.31) by Fisher’s scoring method for GLMs as 
outlined in Sect.5.7.2, but there is a small difference to that section. In the 
working residuals (5.74) we use dummy coding T(Z) e {0, 1}*-! of a 
categorical variable Z, this now needs to be replaced by the estimated vector 
(Z\(0, plY),...,Zx—1(0, p|Y))' € (0,1)X~! which is used as an estimate 
for the latent variable T (Z). Apart from that everything is done as described in 
Sect. 5.7.2; in R this can be done with the procedure mult inom from the package 
nnet [368]. We start the EM algorithm exactly in the final configuration of the 
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Table 6.3 Parameter choices in the mixture models: upper part null model, lower part GLM for 
estimated mixture probabilities p(x; ) 


[k=1 |[k=2 |[k=3 k=4 |k=5 
Null: p 0.04 [0.03 [0.42 0.25 | 0.26 
Null: @ 93.05 [650.94 | 1°040.37 [1.34 |- 
Null: pe 1.207 | 1.108 | 0.888 0.001 | 1.416 
Null: 7 = ae pr 71 588 | 1°172 r304 |- 
GLM: average mixture probabilities 0.04 0.03 0.42 0.25 0.26 
GLM: a1 94.03 |597.20 |1043.38 |128 |- 
GLM: 6,1 1.223 |1.019 | 0.891 0.001 |1.365 
GLM: A =a pr 17 586 r172 r268 |- 


estimated mixture null model, and we run this algorithm for 20 iterations (which 
provides convergences). 

The resulting parameters are given in the lower part of Table 6.3. We observe that 
the resulting parameters remain essentially the same, the second mode Z = 2 is a 
bit less spiky, and the tail parameter is slightly smaller. The summary of this model 
is given on the last line of Table 6.2. Regression modeling adds another 4-45 = 180 
parameters to the model because we have q = 45 feature components in x (different 
from the intercept component). In view of AIC we give preference to the logistic 
mixture probability case (though AIC has to be interpreted with care, here, because 
we do not consider the MLE but rather a local maximum). 

Figure 6.13 plots the individual estimated mixture probabilities x; œ> P(x;) € 
As over the insurance policies | < i < n; these plots are inspired by the thesis of 
Frei [138]. The upper plots consider these probabilities against the estimated claim 
sizes fi(x;) = Ys Pk(xi)@k and the lower plots against the ranks of {7(x;), the 
latter gives a different scaling on the x-axis because of the heavy-tailedness of the 
claims. The plots on the left-hand side show all individual policies 1 < i < n, and 
the plots on the right-hand side show a quadratic spline fit to these observations. Not 
surprisingly, we observe that the claim size estimate 7(x;) is mainly driven by the 
large claims probability p5(x;) describing the Lomax contribution. 

In Fig. 6.14 we compare the QQ plots of the mixture null model and the one 
where we model the mixture probabilities with the logistic categorical GLM. We 
see that the latter (more complex) model clearly outperforms the more simple one, 
in fact, this QQ plot looks quite convincing for the French MTPL claim size data. 
Finally, we perform a Wald test (5.32). We simultaneously treat all parameters that 
belong to the same feature variable (similar to the ANOVA analysis); for instance, 
for the 22 Regions the corresponding part of the regression parameter y contains 
4-21 = 84 components. The resulting p-values of dropping such components are 
all close to 0 which says that we should not eliminate one of the feature variables. 
This closes the example. a 
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Fig. 6.13 Mixture probabilities x; > P(x;) on individual policies 1 < i < n: (top) against the 
estimated means 7i(x;) and (bottom) against the ranks of the estimated means ji(x;); (hs) over 
policies 1 < i < n and (ths) quadratic spline fit 


Remarks 6.15 


e In Example 6.14 we have chosen a mixture distribution with four gamma 
components and one Lomax component. The reason for choosing the Lomax 
component has been two-fold. Firstly, we need a regularly varying tail to 
model the heavy-tailed property of the data. Secondly, we have preferred the 
Lomax distribution over the Pareto distribution because this provides us with a 
continuous density in (6.39). The results in Example 6.14 have been satisfactory. 
In most practical approaches, however, this approach will not work, even when 
fixing the threshold M of the Lomax component. Often, the nature of the data 
is such that the chosen gamma mixture distribution is not able to fully explain 
the small data in the body of the distribution, and in that situation the Lomax tail 
will assist in fitting the small claims. The typical result is that the Lomax part 
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Fig. 6.14 QQ plots of the mixture models: (lhs) null model and (rhs) logistic categorical GLM for 
mixture probabilities 


then pays more attention to small claims (through the log-likelihood function of 
numerous small claims) and the fitting of the tail turns out to be poor (because 
a few large claims do not sufficiently contribute to the log-likelihood). There are 
two ways to solve this dilemma. Either one works with composite distributions, 
see (6.56) below, and one drops the continuity property of the density; this is the 
approach taken in Fung et al. [148]. Or one fits the Lomax distribution solely 
to large observations in a first step, and then fixes the parameters of the Lomax 
distribution during the second step when fitting the full model to all data, this 
is the approach taken in Frei [138]. Both of these two approaches have been 
providing good results on real insurance data. 

There is an asymptotic theory for the optimal selection of the number of 
mixture components, we refer to Khalili-Chen [214] and Khalili [213]. Fung et 
al. [148] combine this asymptotic theory of mixture component selection with 
feature selection within these mixture components using LASSO and SCAD 
regularization. 

In Example 6.14 we have only modeled the mixture probabilities feature depen- 
dent, but not the parameters of the gamma mixture components. Introducing 
regressions for the gamma mixture components needs some care in fitting. For 
policy independent shape parameters a, ..., a@4, we can estimate the regression 
functions for the means of the mixture components without explicitly specifying 
a, because these shape parameters cancel in the score equations. However, these 
shape parameters will be needed in the E-step, which requires also MLE of ax. 
For more discussion on shape parameter estimation we refer to Sect. 5.3.7 (GLM 
with constant shape parameter) and Sect. 5.5.4 (double GLM). 
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6.4 Truncated and Censored Data 


6.4.1 Lower-Truncation and Right-Censoring 


A common problem in insurance is that we often have truncated or censored 
observations. Truncation naturally occurs if we sell insurance products that have 
a deductible d > 0 because in that case only the insurance claim (Y — d)¥+ is 
compensated, and claims below the deductible d are usually not reported to the 
insurance company. This case is called lower-truncation, because claims below the 
deductible are not observed. If we lower-truncate an original claim Y ~ f(-; 0) with 
lower-truncation point t € R we obtain the density 


CE (6.41) 


fa,œ) (y; 8) = 


if F(-; 0) is the distribution function corresponding to the density f(-; 0). The 
lower-truncated density f(z,50)(y; @) only considers claims that fall into the interval 
(t, 00). Obviously, we can define upper-truncation completely analogously by 
considering an interval (—oo, t] instead. Figure 6.15 (lhs) gives an example of a 
lower-truncated density, and Fig. 6.15 (rhs) gives an example of a lower- and upper- 
truncated density. 

Censoring occurs by selling insurance products with a maximal cover M > 0 
because in that case only the insurance claim Y A M = min{Y, M} is compensated, 
and the exact claim size above the maximal cover M may not be available. This case 
is called right-censoring because the exact claim amount above M is not known. 
Right-censoring of an original clam Y ~ F(-;6) with censoring point M € R 
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Fig. 6.15 (lhs) Lower-truncated gamma density with t = 2’000, and (rhs) lower- and upper- 
truncated gamma density with truncation points 2'000 and 6'000 
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Fig. 6.16 (lhs) Right-censored gamma distribution with M = 6'000, and (rhs) left- and right- 
censored gamma distribution with censoring points 2'000 and 6’000 


gives the distribution 
Fyam (y; 0) = F(y; @)1ty<my + Upyemy, 


that is, we have a point mass in the censoring point M. We can define left-censoring 
analogously by considering the claim Y V M = max{Y, M}. Figure 6.16 (lhs) shows 
a right-censored gamma distribution with censoring point M = 6'000, and Fig. 6.16 
(rhs) shows a left- and right-censored example with censoring points 2/000 and 
6'000. 

Often in re-insurance, deductibles (also called retention levels) and maximal 
covers are combined, for instance, an excess-of-loss (XL) insurance cover of size 
u > 0 above the retention level d > 0 covers the claim 


(Y — d) ^u = (Y — d)lid<y<d+u} + Ulity>a+u}) = (Y —d)4 — (Y — (d + u))4. 


Obviously, truncation and censoring pose some challenges in regression modeling 
because at the same time we need to consider the density f (-; 0) and the distribution 
function F(-;0) to estimate a parameter 0. Both cases can be understood as 
missing data problems, with censoring providing the number of claims but not 
necessarily the exact claim size, and with truncation leaving also the number of 
claims unknown. These two cases are studied in Fung et al. [147] within the mixture 
of experts models using a variant of the EM algorithm. We use their techniques 
within the EDF framework for right-censored or lower-truncated data. This is done 
in the next sections. 
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6.4.2 Parameter Estimation Under Right-Censoring 


Assume we have a fixed censoring point M > 0 that applies to independent 
observations Y; following EDF densities f(-; 6;, vi/g); for simplicity we assume 
to work with an absolutely continuous EDF in this section. The (incomplete) log- 
likelihood function of canonical parameters 0 = (6;)1<i<n for observations Y ^A M 
is given by 


tyan @)= D> log fis ovio D> log — FM; ði, vi/9)). 
i: Yi<M i: YAM=M 
(6.42) 


We interpret this as an incomplete data problem because the claim sizes Y; above 
the censoring point M are not known. The complete log-likelihood is given by 


ly (0) = ) log f (Yi; 0i, vi/9). 


i=1 


Similarly to (6.32) we calculate a lower bound to the incomplete log-likelihood. 
We focus on one component of Y and drop the lower index i in Y; for this 
consideration. Firstly, if Y A M < M we are in the situation of full claim size 
information and, obviously, we have log-likelihood in that case Y < M 


YO —K(@) 
lyam (0) = ty (0) = ———— + a(Y; v/o). (6.43) 
g/v 


In the second case Y A M = M we do not have precise claim size information. In 
that case we have conditional density of claim Y |{yam=m} = Y |{y> m} above M 


fR: 9, v/p)Ucamy _ FG 8, v/p)Ig>my 


IGE SM) "7 MG. We) ede) 


; (6.44) 
the latter follows because YAM = M has the corresponding point mass in censoring 
point M (we work with an absolutely continuous EDF here). Choose an arbitrary 
density x having the same support as Y|{y>m}, and consider a random variable 
Z ~ x. Using (6.44) and the EDF structure on the last line, we have for Y > M 


lyam(@) = f vO tnu dues 


7 f(z 8, v/p)/m(z) 
= [ro log | y) an 


= [ro log (eV) dv(z) + Dex (AIF CIY = M; 6, v/~)) 


m(z) 
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[ro log (js 9, ve) dv(z) 


(Zz) 
ix [Z] 0 — K (0) 4 
p/v 


IV 


2n (a(Z; v/p)] — Ex [logn (Z)] = Q0; x). 


This allows us to explore the E-step and the M-step similarly to (6.34) and (6.35). 
The E-step in the case Y > M for given canonical parameter estimate gu-) 
reads as 
TO = argmax Q (ar; z) = argmin Der (x | FY = M;6¢-), v/9)) 
a T 
= f(IY = M30", v/ọ). 


This allows us to calculate the estimation of the claim size above M, i.e., under PO 


FO = Eso [Z] = fz feir > M; 07D, v/g) dv(z). (6.45) 


Note that this is an estimate of the censored claim Y |{y>m}. This completes the 
E-step. 
The M-step considers in the EDF case for censored claim sizes Y > M 


Za [Z10 — « (8) 
p/v 
= arg max lfa (0), (6.46) 
0 


gO = 


arg max Q (o: ae) = arg max 
6 6 


the latter uses that the normalizing term a(-; v/g) is not relevant for the MLE of 
0. That is, (6.46) describes the regular MLE step under the observation Y in the 
case of a censored observation Y > M; and if Y < M we simply use the log- 
likelihood (6.43). 


EM algorithm for right-censored data within the EDF 


(0) Choose an initial parameter gO = OO Wwees, 
(1) Repeat fort > 1: 


A =l : ` 
e E-step. Given parameter 6" ) )i<i<n, estimate for the right- 
censored claims Y; > M their sizes by, see (6.45), 


Tat 


) = (Came 


Y; > M; 6°"), vi/9) dv(z). 
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This provides us with an estimated observation 


ra) = a 
Y = (Yit: <m F FO tem) 


l<i<n 


FO 


e M-step. Calculate the MLE 0 
i.e., solve 


= OO Jicin based on observation FO, 


6” = arg max lai (0). 
r Y 


Note that the above EM algorithm uses that the log-likelihood £y (0) of the EDF 
is linear in the observations that interact with parameter 0. We revisit the gamma 
claim size example of Sect. 5.3.7. 


Example 6.16 (Right-Censored Gamma Claim Sizes) We revisit the gamma claim 
size GLM introduced in Sect. 5.3.7. The claim sizes are illustrated in Fig. 13.22. In 
total we have n = 656 observations Y;, and they range from 16 SEK to 211’254 
SEK. We right-censor this data at M = 50/000, this results in 545 uncensored 
observations and 111 censored observations equal to M. Thus, for the 17% largest 
claims we assume to not have any knowledge about the exact claim sizes. We use 
the EM algorithm for right-censored data to fit a GLM to this problem. 

In order to calculate the E-step we need to evaluate the conditional expecta- 
tion (6.45) under the gamma model 


FO- i Jaren awona (6.47) 
a rays” expl-B2}_ aw 1 Gla +1, BM) 
~Ju © 1-G@pM) °°  B 1-G@,BM) ` 


with shape parameter a = v/ọ, scale parameter B = ~9'-Dy/g, see (5.45), and 
scaled incomplete gamma function 


5 
G(a,y) = = | z*—! exp{—z}dz € (0, 1) for y € (0, 00). 
(6.48) 


Thus, we receive a simple formula that allows us to efficiently calculate the E- 
step, and the M-step is exactly the gamma GLM explained in Sect.5.3.7 for the 
(estimated) data F”. 

For the modeling we choose exactly the features as used for model Gamma 
GLM2, this gives q + 1 = 7 regression parameter components and additionally we 
set for the dispersion parameter @MLE = 1.427, this is the MLE in model Gamma 
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Table 6.4 Comparison of the complete log-likelihood and the incomplete log-likelihood (right- 
censoring M = 50'000) results 


# | Log-likelihood i | Dispersion | Average | Rel. 
Param. Ly (MLE, @MLE) | est. @MLE | amount | change 
Gamma GLM2 (complete data) | 7+1 —7'129 1.427 | 25'130 | 
Crude GLM2 (right-censored) | 7+ 1 —7'158 | 18068 | =28% i 
EM est. GLM2 (right-censored) |7+1 | —7'132 | 26687 | +6% 


GLM2. This dispersion parameter we keep fixed in all our models studied in this 
example. In a first step we simply fit a gamma GLM to the right-censored data 
Y; A M. We call this model ‘crude GLM2’, and it underestimates the empirical 
claim sizes by 28% because it ignores the fact of having right-censored data. 

To initialize the EM algorithm for right-censored data we use the model crude 
GLM2. We then iterate the algorithm for 15 steps which provides convergence. The 
results are presented in Table 6.4. We observe that the resulting log-likelihood of 
the model fitted on the censored data and evaluated on the complete data £y (which 
is available here) is almost the same as for model Gamma GLM2, which has been 
estimated on the complete data. Moreover, this right-censored EM algorithm fitted 
model slightly over-estimates the average claim sizes. 

Figure 6.17 shows the estimated means f; on an individual claims level. The 
x-axis always gives the estimates from the complete log-likelihood model Gamma 
GLM2. The y-axis on the left-hand side shows the estimates from the crude GLM 
and the right-hand side the estimates from the EM algorithm fitted counterpart (fitted 
on the right-censored data). We observe that the crude model underestimates the 
claims (being below the diagonal), and the largest estimate lies below M = 50/000 


crude GLM with M=50000 EM estimated GLM with M=50000 
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Fig. 6.17 Comparison of the estimated means fi; in model Gamma GLM2 against (lhs) the crude 
GLM and (rhs) the EM fitted right-censored model; both axis are on the log-scale, the dotted lines 
shows the censoring point log(M) 
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in our example (horizontal dotted line). The EM algorithm fitted model, considering 
the fact that we have right-censored data, corrects for the censoring, and the resulting 
estimates resemble the ones from the complete log-likelihood model quite well. 
In fact, we probably slightly over-estimate under right-censoring, here. Note that 
all these considerations have been done under an identical dispersion parameter 
estimate ME. For the complete log-likelihood case, this is not really needed for 
mean estimation because it cancels in the score equations for mean estimation. 
However, a reasonable dispersion parameter estimate is crucial for the incomplete 
case as it enters Y“) in the E-step, see (6.47), thus, the caveat here is that we need 
a reasonable dispersion estimate from the right-censored data (which we did not 
discuss, here, and which requires further research). E 


6.4.3 Parameter Estimation Under Lower-Truncation 


Compared to censoring we have less information under truncation because not only 
the claim sizes below the lower-truncation point are unknown, but we also do not 
know how many claims there are below that truncation point t. Assume we work 
with responses belonging to the EDF. The incomplete log-likelihood is given by 


ty>1(6) = ) log f (Yi; 8i, vi/p) — log (1 — F(t; 0i, vi/9)), 


i=l 


assuming that Y = (Y;)1<i<n > T collects all claims above the truncation point 
Y; > tT, see (6.41). We proceed as in Fung et al. [147] to construct a complete 
log-likelihood; there are different ways to do so, but this proposal is convenient 
for parameter estimation. Firstly, we equip each observed claim Y; > t with an 
independent count random variable K; ~ p(-; 8i, vi/g) that determines the number 
of claims below the truncation point that correspond to claim i above the truncation 
point. Secondly, we assume that these claims are given by independent observations 
Zi1,..., Zi,K; < T, a.s., with a distribution obtained from an un-truncated version 
of Y;, i.e., we consider the upper-truncated version of f(-; 0i, vj/g) for Z;,;. This 
gives us the complete log-likelihood 


n f (%; 0i, vi /@) ) 
p tow ( 100 vo 6.49 
(Y,K,z)(9) ns 08 (; — F(t; 0i, vi/p) i l 


Ki 


Zi, j; Qi, Vi 
+ log p(K;; 0i, vi/p) +) log (Lae) ), 
j=l aa 


6.4 Truncated and Censored Data 255 
with K = (K;)1<j<n, and Z collects all (latent) claims Z;,; < t, an empty sum is 
set equal to zero. Next, we assume that K; is following the geometric distribution 


Po [Ki =k] = p(k; 0i, v:/9) = F(t; 6, v:/9)* (d — F(t; 0i, vi/9)). 
(6.50) 


As emphasized in Fung et al. [147], this complete log-likelihood is an artificial 
construct that supports parameter estimation of lower-truncated data. It does not 
claim that the true un-truncated data follows this model (6.49) but it provides 
a distributional extension below the truncation point t > O that is convenient 
for parameter estimation. Namely, inserting this geometric distribution assumption 
into (6.49) gives us complete log-likelihood 


n Ki 


lyk, z0) = J | log fO 0i, vi/9) + $ log (Zij 6, vi/9) |. (6.51) 
i=) j=! 


Within the EDF this allows us to do the same EM algorithm considerations as above; 
note that this expression no longer involves the distribution function. We consider 
one observation Y; > t and we drop the lower index i. This gives us complete 
observation (Y, K, Z = (Z;)1<j<x) and conditional density 


fO,k,z;0, v/o) _ f(y, k, z5 8, 0/9) 


k, zly; 0, 0/9) = ~ “exp{lyays7@)} ” 
fC z|y v/) Ras (y; 6, v/o) exp{lr=y>r(0)} 


where £y >+ (0) is the log-likelihood of the lower-truncated datum Y > t. Choose an 
arbitrary density m modeling the random vector (K, Z) below the truncation point 
t. This gives us for the random vector (K, Z) ~ x 


£ly>:(0) = fra z) ly>r (0) dv(k, z) 


_ f, k, z; 0, v/p)/mk, z) 
= [reon Tear vermeD) E 


Y, k,z; 0, 
= frade (LEETMA) ave, d + Deu ILA 0, v/o) 


> fra z) log (a) dv(k, z) 


= E, [y,x,2)| Y] — Ex [log (K, Z)] 


K 
= log f (Y; 0, v/o) + Ex X log f (Z;; 0, v/¢) — E, [log x (K, Z)] 
j=l 


def. 


= Q(0; x), 
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where the second last identity uses that the log-likelihood (6.51) has a simple form 
under the geometric distribution chosen for K; this is exactly the step where we 
benefit from this specific choice of the probability extension below the truncation 
point. There is a subtle point here. Namely, £y.,(@) is the log-likelihood of the 
lower-truncated datum Y > t, whereas log f (Y; 0, v/@) is the log-likelihood not 
using any lower-truncation. 

The E-step for given canonical parameter estimate 6-)) reads as 


RO = argmax Q (EV; r) = argmin Dax (x | fex; 0", v/o) 
1 qt: 
=f (. i; ED, v/o) 


i » 00-1) 
_ „(P-D fCj 0%" 7, v/@) 
= p (+4 .v/9) LL eav 


The latter describes a compound distribution for Da Zj with a geometric count 
random variable K and independent i.i.d. random variables Z1, Z2,..., having 
upper-truncated densities f(—oo,r] (3 gu-D, v/g). This allows us to calculate the 
expected compound claim below the truncation point 


K 
PH? = Ezo | YZ; | = Epo [K] Ego [Z1] 
j=l 


F(t; 0¢-), v/g) 


Z . 04-1) 
= FG: FD, I Z fioo: , v/o) dv(z). 


This completes the E-step. 
The M-step considers within the EDF 


go = arg max Q (0: ee) 
e 


Y + Ezo [ok zil) 0 — (1+ Ezo [Kk (0) 
W 


= arg max vÀ + Ezo [KD Per 6— K(0) 
6 g 1+ Ezo [K] l 
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That is, the M-step applies the classical MLE step, we only need to change weights 
and observations 


v 
1 F(t; 0¢-), v/o)’ 


vis v® = v(1+ Eso [K]) 


Y+ ee — Y +Egw [KJ Ego [Z1] 


Ye FO =- —— 5 = 
1 + Ez% [K] 1 + Ezo% [K] 


Note that this uses the specific structure of the EDF, in particular, we benefit from 
linearity here which allows for closed-form solutions. 


EM algorithm for lower-truncated data within the EDF 


(0) Choose an initial parameter gO = Oust 
(1) Repeat fort > 1: 


i —~(t—1 = : 
e E-step. Given parameter 0" toe OM D) iicn, estimate the number of 


claims K and the corresponding claim sizes Z;, j by 


=j 
FOL F(t; 0", v;/@) 
1— F(t; 8", vio) 


l 


A 


2 = f z fco, (z; 8, vi /p) dve). (6.52) 
This provides us with estimated weights and observations for | <i < n 


Yi + ROZ 


vf? = vj (1 + K) and fO = = 
1+K; 


E 


e M-step. Calculate the MLE gO = @”) 1<i<n based on observations y” = 
Ci. and weights 0 = CO eee i.e., solve 


n 
6” = eee lgo (0; DO Jp) = E as X log JEO; 8i, DO /9). 


i=1 


Remarks 6.17 Essentially, the above algorithm uses that the MLE in the EDF is 
based on a sufficient statistics of the observations, and in our case this sufficient 
statistics is Aan 


Example 6.18 (Lower-Truncated Claim Sizes) We revisit the gamma claim size 
GLM introduced in Sect. 5.3.7, see also Example 6.16 on right-censored claims. We 
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choose as lower-truncation point t = 1’000, i.e., we get rid of the very small claims 
that mainly generate administrative expenses at a rather small claim compensation. 
We have 70 claims below this truncation point, and there remain n = 586 claims 
above the truncation point that can be used for model fitting in the lower-truncated 
case. We use the EM algorithm for lower-truncated data to fita GLM to this problem. 

In order to calculate the E-step we need to evaluate the conditional expecta- 
tion (6.52) under the gamma model for truncation probability 


T 
F(t; 00D, v/o) = 1 
0 


a 


raw) z°—! exp{—Bz} dz = Gia, Br), 


with shape parameter a = v/g and scale parameter 6 = pe) 


analogy to (6.47) we have 


v/o. In complete 


A 


2P = fe fea, voava = {9040 


B G(a, Br) 


For the modeling we choose again the features as used for model Gamma GLM2, 
this gives g+1 = 7 regression parameter components and additionally we set for the 
dispersion parameter PME = 1.427. This dispersion parameter we keep fixed in all 
the models studied in this example. In a first step we simply fit a gamma GLM to the 
lower-truncated data Y; > t. We call this model ‘crude GLM2’, and it overestimates 
the true claim sizes because it ignores the fact of having lower-truncated data. 

To initialize the EM algorithm for lower-truncated data we use the model crude 
GLM2. We then iterate the algorithm for 10 steps which provides convergence. 
The results are presented in Table 6.5. We observe that the resulting log-likelihood 
fitted on the lower-truncated data and evaluated on the complete data £y (which is 
available here) is the same as for model Gamma GLM2 which has been estimated 
on the complete data. Moreover, this lower-truncated EM algorithm fitted model 
slightly under-estimates the average claim sizes. 

Figure 6.18 shows the estimated means f; on an individual claims level. The 
x-axis always gives the estimates from the complete log-likelihood model Gamma 
GLM2. The y-axis on the left-hand side shows the estimates from the crude GLM 
and the right-hand side the estimates from the EM algorithm fitted counterpart 
(fitted on the lower-truncated data). We observe that the crude model overestimates 


Table 6.5 Comparison of the complete log-likelihood and the incomplete log-likelihood (lower- 
truncation t = 1/000) results 


# Log-likelihood Dispersion Average | Rel. 
Param. | ly (@MLE | @MLE) | est. MLE | amount change 
Gamma GLM2 (complete data) | 7+ 1 —7'129 1.427 25'130 | 
Crude GLM2 (lower-truncated) 7+1 —7133 26879 | +7% 
EM est. GLM2 (lower-truncated) |7 +1. | —7/129 24900 | —1% 
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Fig. 6.18 Comparison of the estimated means fi; in model Gamma GLM2 against (lhs) the crude 
GLM and (ths) the EM fitted lower-truncated model; both axis are on the log-scale 


the claims (being above the orange diagonal), in particular, this applies to claims 
with lower expected claim amounts. The EM algorithm fitted model, considering 
the fact that we have lower-truncated data, corrects for the truncation, and the 
resulting estimates almost completely coincide with the ones from the complete log- 
likelihood model. Again we remark that we use an identical dispersion parameter 


estimate M-E, and it is an open problem to select a reasonable value from lower- 
truncated data. a 


Example 6.19 (Zero-Truncated Claim Counts and the Hurdle Poisson Model) In 
Sect. 5.3.6, we have been studying the ZIP model that has assigned an additional 
probability weight to the event {N = 0} of having zero claims. This model can 
be understood as a hierarchical model with a latent variable Z indicating whether 
we have an excess zero claim or not, see (5.41). In that situation we have a 
mixture distribution of a Poisson distribution and a degenerate distribution. Fitting 
in Example 5.25 has been done brute force by using a general purpose optimizer, 
but we could also use the EM algorithm for mixture distributions. 

An alternative way of modeling excess zeros is the hurdle approach which 
combines a lower-truncated count distribution with a point mass in zero. For the 
Poisson case this reads as, see (5.42), 


TO for k = 0, 
Jhurdle Poisson (k; À, v, 70) = e—và Wik (6.53) 


(1 = m= fork € N, 


for mo € (0,1) and à, v > 0. If we ignore any observation {N = 0} we obtain 
a lower-truncated Poisson model, also called zero-truncated Poisson (ZTP) model. 
This ZTP model can be fitted with the EM algorithm for lower-truncated data. In the 
following we only consider insurance policies i with N; > 0. The log-likelihood of 
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the ZTP model N > Ois given by (we consider one single component only and drop 
the lower index in the notation) 


6 +> lyso(0) = NO — ve? — log(N!) + N log(v) — log(1 — goer, (6.54) 
with exposure v > 0 and canonical parameter 0 € © = R such that à = exp{6}. 


The ZTP model provides for the random variable K the following geometric 
distribution (for the number of claims below the truncation point), see (6.50), 


PLK =k] = Po[N =0}* Po[N > 0] = e” (1- e). 


In view of (6.51), this gives us complete log-likelihood (note that Z; = 0 for all j) 


K 
lw,K,z)(0) = NO — ve? — log(N!) + N log(v) + ) > (Zj0 — ve? — log(Z;!) + Zj log(v)) 
j=l 


= NO — (1 + K) ve? — log(N!) + N log(v). 


We can now directly apply a simplified version of the EM algorithm for lower- 
truncated data. For the E-step we have, given parameter 007D, 


w 

re Pse- N = (0) T pE 

RO Poe IN =O] _ E and Z0 = 0. 
1— Põe-d [N = 0] _ eave! ) 


This provides us with the estimated weights and observations (set Y = N/v) 


omer (1 4 RO) 2 = — md 7s 
1— eve’ 1+ KO yt) 
(6.55) 


Thus, the EM algorithm iterates Poisson MLEs, and the E-Step modifies the weights 
v© in each step of the loop correspondingly. We remark that the ZTP model 
has an EF representation which allows one to directly estimate the corresponding 
parameters without using the EM algorithm, see Remark 6.20, below. 

We revisit the French MTPL claim frequency data, and, in particular, we use 
model Poisson GLM3 as a benchmark, we refer to Tables 5.5 and 5.10. The feature 
engineering is done exactly as in model Poisson GLM3. We then select only the 
insurance policies from the learning data £ that have suffered at least one claim, i.e., 
N; > 0. These are m = 22'434 out of n = 610’206 insurance policies. Thus, we 
only consider m/n = 3.68% of all insurance policies, and we fit the lower-truncated 
log-likelihood (ZTP model) to this data 


m 
; —». 05 
£n>0(b) = > Ni; — vie” — log(N;j!) + Nj log(vi) — logd — e~"*"), 
{=l 
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Fig. 6.19 (lhs) Convergence of the EM algorithm for the lower-truncated data in the Poisson 
hurdle case; (rhs) canonical parameters of the Poisson GLMs fitted on all data £ vs. fitted only 
on policies with N; > 0 


Table 6.6 Run times, number of parameters, AICs, in-sample and out-of-sample deviance losses 
(units are in 107°) and in-sample average frequency of the Poisson null model and the Poisson, 
negative-binomial, ZIP and hurdle Poisson GLMs 


[Run |# | In-sample | Out-of-sample Aver. 
time | Param. | AIC jloss on £ | loss on T freq. 
Poisson null - 1 | 199°506 | 25.213 | 25.445 7.36% 
‘Poisson GLM3 115s | 50 | 192°716 |24.084 [24.102 7.36% 
‘NB GLM3aNE = 1.810 [85s | 51 [192113 |2072 |20.674 |738% 
ZIP GLM3 (null 7o) 270s | 51 | 1927393 |- |- 7.37% 
Hurdle Poisson GLM3 | 300s |100 | 191’851 |- -— 7.39% 


where | < i < m runs over all insurance policies with at least one claim and where 
the canonical parameter 0; is given by the linear predictor 0; = (B,x;). We fit this 
model using the EM algorithm for lower-truncated data. In each loop this requires 
that the offset o” = logt) is adjusted according to (6.55); for the discussion of 
offsets we refer to Sect. 5.2.3. Convergence of the EM algorithm is achieved after 
roughly 75 iterations, see Fig. 6.19 (lhs). 

In our first analysis we do not consider the Poisson hurdle model, but we simply 
consider model Poisson GLM3. However, this Poisson model with regression 
parameter ĝ is fitted only on the data N; > O (exactly using the results of the 
EM algorithm for lower-truncated data N; > 0). The resulting predictive model is 
presented in Table 6.7. We observe that model Poisson GLM3 that is only fitted on 
the data N; > 0 is clearly not competitive, i.e., we cannot simply extrapolate this 
estimated model to {N; = 0}. This extrapolation results in a Poisson GLM that has 
a much too large average frequency of 15.11%, see last column of Table 6.7; this 
bias can clearly be seen in Fig.6.19 (rhs) where we compare the two fits. From 
this we conclude that either the Poisson model assumption in general does not 
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Table 6.7 Number of parameters, in-sample and out-of-sample deviance losses on all data 
(units are in 107°), out-of-sample lower-truncated log-likelihood €y 59 and in-sample average 
frequency of the Poisson null model and model Poisson GLM3 fitted on all data £ and fitted on 
the data N; > 0 only 


# In-sample | Out-of-sample Aver. 

Param. |Losson£ |Losson7T | €nso freq. 
Poisson null 1 25.213 25.445 — 7.36% 
Poisson GLM3 fitted on all data |50 24.084 24.102 —0.2278 7.36% 
Poisson GLM3 fitted on N; > 0 |50 28.064 28.211 —0.2195 | 15.11% 


match the data, or that we have excess zeros (which do not influence the estimation 
procedure if we only consider the policies with at least one claim). Let us compare 
the lower-truncated log-likelihood £y>0 out-of-sample only on the policies with at 
least one claim (ZTP model). We observe that the EM fitted model provides a better 
description of the data, as we have a bigger log-likelihood than the model fitted on 
all data £ (i.e. —0.2195 vs. —0.2278 for the ZTP log-likelihood). Thus, the lower- 
truncated fitting procedure finds a better model on {N; > 0} when only fitted on 
these lower-truncated claim counts. 

This analysis concludes that we need to fit the full hurdle Poisson model (6.53). 
That is, we cannot simply extrapolate the model fitted on the ZTP log-likelihood 
£nso0 because, typically, mo(x;) Æ exp{—vje'8-*i)}, the latter coming from the 
Poisson GLM with regression parameter 8. We model the zero claim probability 
xo(xi) by the logistic Bernoulli GLM indicating whether we have claims or not. 
We set up the logistic GLM for p(x;) = 1 — xo0(x;) of describing the indicator 
Yi = 1,n;,>0; of having claims. The difficulty compared to the Poisson model is that 
we cannot easily integrate the time exposure v; as a pro rata temporis variable like 
in the Poisson case. We therefore make the following considerations. The canonical 
link in the logistic Bernoulli GLM is the logit function p +> logit(p) = log(p/(1— 
P)) = log(p) — log(1 — p) for p € (0, 1). Typically, in our application, p < 1 is 
fairly small because claims are rare events. This implies log(p/(1 — p)) ~ log(p), 
i.e., the logit link behaves similarly to the log-link for small default probabilities p. 
This motivates to integrate the logged exposures log v; as offsets into the logistic 
probabilities. That is, we make the following model assumption 


(x, v) + logit(p(xj, vi)) = log(v;) + (B, xi), 


with offset o; = log(v;) and regression parameter B e R+! , We fit this model using 
the R command glm using family=binomial (). The results then allow us to 
define the estimated hurdle Poisson model by, recall p(x;, vj) = 1 — m0(xj, vi), 


1 — p(xj, vi) = (1 + exp{log(v;) + (B,x;)}) | fork = 0, 


Sourdle Poisson (K; Xi, vi) = baa are ee: 
l S cee oo fork € N, 
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Table 6.8 Contingency table of the observed numbers of policies against predicted numbers of 
policies with given claim counts ClaimNb (in-sample) 


Numbers of claims ClaimNb 


0 1 2 3 4 |5 
Observed number of policies 5877772 | 217198 | 17174 | 57 1 
Poisson predicted number of policies 587325 | 22’064 | 779 34 3 | 0.3 
NB predicted number of policies 587°902 | 20°982 |1°200 |100 |15 |4 
ZIP predicted number of policies 587°829 | 21°094 |1191 79 9 |4 
Hurdle Poisson predicted number of policies | 587772 |21119 [1233 | 76 | 6 |1 


where B € R4! is the regression parameter from the logistic Bernoulli GLM, 
and where (xj, v;i) = v;exp(ß, xi) is the Poisson GLM estimated with the 
EM algorithm on the lower-truncated data N; > O (ZTP model). The results are 
presented in Table 6.6. 

Table 6.6 compares the hurdle Poisson model to the approaches studied in 
Table 5.10. Firstly, fitting the hurdle Poisson model is more time intensive, the EM 
algorithm takes some time and we need to fit the Bernoulli logistic GLM which 
is of a similar complexity as fitting model Poisson GLM3. The results in terms of 
AIC look convincing. The hurdle Poisson model provides an excellent model for the 
indicator of having a claim (here it outperforms model ZIP GLM3). It also tries to 
optimally fit a ZTP model to all insurance policies having at least one claim. This 
can also be seen from Table 6.8 which determines the expected number of policies 
that suffer the different numbers of claims. 

We close this example by concluding that the hurdle Poisson model provides the 
best description, at the price of using more parameters. The ZIP model could be 
lifted to a similar level, however, we consider fitting the hurdle approach to be more 
convenient, see also Remark 6.20, below. In particular, feature engineering seems 
simpler in the hurdle approach because the different effects are clearly separated, 
whereas in the ZIP approach it is more difficult to suitably model the excess zeros, 
see also Listing 5.10. This closes this example. a 


Remark 6.20 In (6.54) we have been considering the ZTP model for different 
exposures v > 0. If we set these exposures to v = 1, we obtain the ZTP log- 
likelihood 

v>0(0) = NO — (e + log(1 — ey) — log(N)). 


Note that this describes a single-parameter linear EF with cumulant function 


(0) = e? + log(1 — e7), 
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for canonical parameter in the effective domain 0 € © = R. The mean of this EF 
model is given by 


6 
P e À 
u = E[N] = 1'0) = a = 7 


=g E 


where we set A = e° . The variance is given by 


e* —(1+A) 


Varo(N) = «(8) = u( aai 


) = u(1— pe™) > 0. 


Note that the term in brackets is positive but less than one. The latter implies that 
the ZTP model has under-dispersion. Alternatively to the EM algorithm, we can 
also directly fit a GLM to this ZTP model. The only difficulty is that we need to 
appropriately integrate the time exposures. The original Poisson model suggests 
that if we choose the canonical parameter being equal to the linear predictor, we 
should integrate the logged exposures as offsets into the linear predictors. Along 
these lines, if we choose the canonical link h = (k’ y of the ZTP model, we 
receive that the canonical parameter 0 is equal to the linear predictor ($, x), and we 
can directly integrate the logged exposures as offsets into the canonical parameters, 
see (5.25). This then allows us to directly fit this ZTP model with exposures using 
Fisher’s scoring method. In this case of a concave log-likelihood function, the result 
will be identical to the solution of the EM algorithm found in Example 6.19, and, in 
fact, this direct approach is more straightforward and more time-efficient. Similar 
considerations can be done for other hurdle models. 


6.4.4 Composite Models 


In Sect. 6.3.1 we have promoted to mix distributions in cases where the data cannot 
be modeled by a single EDF distribution. Alternatively, one can also consider to 
compose densities which leads to so-called composite models (also called splicing 
models). This idea has been introduced to the actuarial literature by Cooray—Ananda 
[81] and Scollnik [332]. Assume we have two absolutely continuous densities 
fÀ C; 6;) with corresponding distribution functions F® (-; 6;), i = 1, 2. These two 
densities can easily be composed at a splicing value t and with weight p € (0, 1) 
by considering the following composite density 


fOQ; 02) 1 y>r} 


fy; D1 yer} 
1— F(t; 02) ° 


Olean (6.56) 


fO: p, 91,02) = p +(1— p) 


supposed that both denominators are non-zero. In this notation we treat splicing 
value t as a hyper-parameter that is chosen by the modeler, and is not estimated 
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from data. In view of (6.41) we can rewrite this in terms for lower- and upper- 
truncated densities 


FO: Pr 1,02) = p fod. 5 1) FD = P) AO oo) h). 


In this notation, we see that a composite model can also be interpreted as a mixture 
model with mixture probability p € (0, 1) and mixing densities ia i and f 
having disjoint supports (00, t] and (t, 00), respectively. 

These disjoint supports allow for simpler MLE, i.e., we do not need to rely on 
the ‘EM algorithm for mixture distributions’ to fit this model. The log-likelihood of 
Y ~ f(y; p, 41, 02) is given by 


ly (p, 01, 62) = (log(p) + log fo, C: 61) Tyrer 
+ (logd = p) + log AO). Y: 62) lyn. 


This shows that the log-likelihood nicely decouples in the composite case and all 
parameters can directly be estimated with MLE: parameter 6; uses all observations 
smaller or equal to t, parameter 02 uses all observations bigger than t, and p is 
estimated by the proportions of claims below and above the splicing point t. This 
holds for a null model as well as for a GLM approach for 01, 62 and p. 

Nevertheless, the EM algorithm may still be used for parameter estimation, 
namely, truncation may ask for the ‘EM algorithm for truncated data’. Alternatively, 
we could also use the ‘EM algorithm for censored data’ to estimate the truncated 
densities, because we have knowledge of the number of claims above and below the 
splicing point t, thus, we could right- or left-censor these claims. The latter may 
lead to more stability in the estimation procedure since we use more information 
in parameter estimation, i.e., the two truncated densities will not be independent 
because they simultaneously consider all claim counts (but not identical claim sizes 
due to censoring). 

For composite models one sometimes requires more regularity in the densities, 
we may, e.g., require continuity in the density in the splicing point which provides 
mixture probability 


f(t; 02) F(z; 61) 


P = FOE: O)d — FOG O) + FOC: FOG A) 


This reduces the number of parameters to be estimated but complicates the score 
equations. If we require a differential condition in t we receive requirement 


fy (E; 2) F(T; 1) 


p= -m ee o a 
AS? (ts OY — FOC; 62) + fP: FOCE; 01) 
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where AO: 0i) denotes the first derivative w.r.t. y. Together with the continuity 
this provides requirement for having differentiability in t 


A) 
FOTO) fO) 


Again this reduces the degrees of freedom in parameter estimation but complicates 
the score equations. We refrain from giving an example and close this section; we 
will consider a deep composite regression model in Sect. 11.3.2, below, where we 
replace the fixed splicing point by a quantile for a fixed quantile level. 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons licence and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons licence, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons licence and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 7 ® 
Deep Learning cost 


In the sequel, we introduce deep learning models. In this chapter these deep 
learning models will be based on fully-connected feed-forward neural networks. We 
present these networks as an extension of GLMs. These networks perform feature 
engineering themselves. We discuss how networks achieve this, and we explain how 
networks are used for predictive modeling. There is a vastly growing literature on 
deep learning with networks, the classical reference is the book of Goodfellow et 
al. [166], but also the numerous tutorials around the open-source deep learning 
libraries TensorFlow [2], Keras [77] or PyTorch [296] give an excellent overview 
of the state-of-the-art in this field. 


7.1 Deep Learning and Representation Learning 


In Chap. 5 on GLMs, we have been modeling the mean structure of the responses 
Y, given features x, by the following regression function, see (5.6), 


x > w(x) = Egay [Y] = g7! (B, x). (7.1) 


The crucial assumption has been that the regression function (7.1) provides a 
reasonable functional description of the expected value Eg(x)[Y] of datum (Y, x). 
As described in Sect. 5.2.2, this typically requires manual feature engineering of x, 
bringing feature information into the right structural form. 

In contrast to manual feature engineering, deep learning aims at performing an 
automated feature engineering within the statistical model by massaging infor- 
mation through different transformations. Deep learning uses a finite sequence of 
functions (z“) 1<m<a, called layers, 


zn) : {1} x Rim- zi {1} x RI”, 
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of (fixed) dimensions qm € N, 1 < m < d, and initialization qọ = q being the 
dimension of the (raw) feature information x € ¥ C {1} x R1. Each of these 
layers presents a new representation of the features, that is, after layer m we have a 
dm-dimensional representation of the raw feature x € V 


Beso E ae Eee ae) (x) € {1} xR. (7.2) 


Note that the first component is always identically equal to 1. For this reason we 
call the representation z°"") (x) € {1} x R% of x to be gm-dimensional. 

Deep learning now assumes that we have d € N appropriate transformations 
(layers) z™, 1 < m < d, such that zD (x) provides a suitable qą-dimensional 
representation of the raw feature x € X, that then enters a GLM 


w(x) = Eœ) [Y] = g (B, z4 @)), (1.3) 


with link function g : M —> R and regression parameter B € R44+1, This 
regression architecture is called a feed-forward network of depth d € N because 
information x is processed in a directed acyclic (feed-forward) path through the d 
layers z®,...,z® before entering the final GLM. 

Each layer z“”) involves parameters. Successful deep learning simultaneously 
fits these parameters as well as the regression parameter B to the available learning 
data £ so that we obtain an optimal predictive model on the test data 7. That is, 
the learned model should optimally generalize to unseen data, we refer to Chap. 4 
on predictive modeling. Thus, the process of optimal representation learning is also 
part of the model fitting procedure. In contrast to GLMs, the resulting log-likelihood 
functions are non-concave in their parameters because, typically, each layer involves 
non-linear transformations. This makes model fitting a challenge. State-of-the-art 
model fitting in deep learning uses variants of the gradient descent algorithm which 
we have already met in Sect. 6.2.4. 


Remark 7.1 Representation learning x +> z‘@"(x) is closely related to Mercer’s 
kernel [272]. If we have a portfolio with features x1, ..., Xn, we obtain a Mercer’s 
kernel by considering the matrix 


K = (Kæ x) ci jen = (a e) E€ RP”, (14 


ISi,j<n 


In many regression problems it can be shown that one can equivalently work 
with the design matrix 3 = (z@)(x1),...,z@ (xn)! € R™* 44) or with 
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Mercer’s kernel K € R”*”. Mercer’s kernel does not require the full knowledge 
of the learned representations z‘!)(x;), but it suffices to know the discrepancies 
between z@-)(x;) and ZED (x ;) measured by the scalar products K (x;, x ;). This 
is also closely related to the cosine similarity in word embeddings, see (10.11). This 
approach then results in replacing the search for an optimal representation learning 
by a search of the optimal Mercer’s kernel for the given data; this is called the kernel 
trick in machine learning. 


7.2 Generic Feed-Forward Neural Networks 


Feed-forward neural (FN) networks use special layers z™ in (7.2)-(7.3), whose 
components are called neurons. This is discussed and studied in detail in this section. 


7.2.1 Construction of Feed-Forward Neural Networks 


FN networks are regression functions of type (7.3) where each neuron ie l1 < 


j < qm, Of the layers z™ = (1, z 


a GLM; the first component ae = | always plays the role of the intercept and does 


not need any modeling. 

A first important choice is the activation function @ : R — R which plays the 
role of the inverse link function g~!. To perform non-linear representation learning, 
this activation function should be non-linear, too. The most popular choices of 
activation functions are listed in Table 7.1. 

The first three examples in Table 7.1 are smooth functions with simple deriva- 
tives, see the last column of Table 7.1. Having simple derivatives is an advantage in 
gradient descent algorithms for model fitting. The derivative of the ReLU activation 
function for x Æ 0 is given by the step function activation, and in 0 one typically 
considers a sub-gradient. We briefly comment on these activation functions. 


ae ee ay", 1 <m < d, has the structure of 


Table 7.1 Popular choices of non-linear activation functions and their derivatives; the last two 
examples are not strictly monotone 


Activation function Derivative 
Sigmoid (logistic) activation px) = (1+e*)7! ¢’=¢(1- ¢) 
“Hyperbolic tangent activation (x) = tanh(x) g’=1- 
Exponential activation (x) = exp(x) t =o 
“Step function activation g(x) = Lix>0} 


Rectified linear unit (ReLU) activation o(x) = x1 .>0) 
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Fig. 7.1 Hyperbolic tangent hyperbolic tangent function 
activation function 

x |> tanh(wx) € (—1, 1) for 
(fixed) weights 

w € {1/5, 1, 5} and 

x € (—10, 10) 


tanh 


e We are mainly going to use the hyperbolic tangent activation function 


ex —e-* 3 = 
te anho So 2(1+ e7 "j -1 e (1,1). 
ex + e—* 


Figure 7.1 illustrates the hyperbolic tangent activation function. 

The hyperbolic tangent activation function is anti-symmetric w.r.t. the origin 
with range (—1, 1). This anti-symmetry and boundedness is an advantage in 
fitting deep FN network architectures. For this reason we usually prefer the 
hyperbolic tangent over other activation functions. 

e The sigmoid activation function corresponds to the logistic function that was 
used in the Bernoulli and the categorical EFs, see Sects. 2.1.2 and 5.7. The sig- 
moid activation function can be obtained from the hyperbolic tangent activation 
function by setting (x) = (tanh(x/2) + 1)/2. 

¢ The step function activation is not really used in applications. However, it allows 
for nice interpretations, and it links FN networks to the theory of regression and 
classification trees (CARTs); see Breiman et al. [54] for CARTs. 

¢ The exponential activation function is a nice differentiable choice whenever the 
range should be one-sided bounded. 

¢ The ReLU activation function is also called hinge function or ramp function. This 
is the preferred choice in the machine learning community. However, typically, 
we will not use it because in our experience it is less robust in fitting compared to 
the hyperbolic tangent activation function. This may be for two reasons, firstly, 
the ReLU activation is unbounded, and secondly, it is identically equal to zero 
for x < 0, which implies that there is no sensitivity in negative choices of x. 
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A FN layer with activation function ¢ is a mapping 
m) < {1} x Rol —> {1} x RI" (7.5) 
(m) (m) mo) 
O a E 


having neurons for 1 < j < qm 


qm-1 
zy @) = bwi”,z 1-0 (5 wp z | (1.6) 


with given network weights w= (w \0<1<4m- , € Re il, 


Interpretation Every neuron z t> z (z) describes a GLM regression function 


with link function @~! and regression parameter w™ e Ram-1+! for features 


z € {1} x R™-!. These GLM regression functions can be interpreted as data 
compression, i.e., in each neuron the g,,—-dimensional feature z is projected to 
a real number (wi ) z) € R which is then (non-linearly) activated by ø. Since 
this leads to a substantial loss of information, we perform this ae of data 
compression qm times in FN layer z™) so that each neuron in ay (z))1< j<qm 
represents a different projection of input z. Choosing suitable weights w™ will 
allow us to extract the crucial feature information from z to receive good explanatory 


variables for the regression task at hand. 


A FN network of depth d €e N is obtained by composing d FN layers 
z%®,...,z® to receive the mapping 


zt). {1} x ROSI — {1} x RU (7.7) 


x me z4D (x) = ee Gove ont (x). 


Choosing a strictly monotone and smooth link function g and a regression 
parameter B € IR@+! we receive the FN network regression function 


xe we pr) =e (Bp, 24 (x). (7.8) 


272 7 Deep Learning 


Fig. 7.2 FN network of depth d = 3, with number of neurons (q1, g2, q3) = (20, 15, 10) and 
input dimension go = 40. This gives us a network parameter # € R” of dimension r = 1/306 


This FN network regression function (7.8) has a network parameter } = 


w, A wi, B)' € R" of dimension 
d 
r= > Gm(Gm-1 +1)+ (qa +1). 
m=1 


In Fig. 7.2 we illustrate a FN network of depth d = 3, FN layers of dimensions 
(q1, 92, 93) = (20, 15, 10) and input dimension go = 40.! This gives us a network 
parameter # € R” of dimension r = 1/306. On the left-hand side we have the raw 
features x € X C {1} x R®, these are processed through the three FN layers, where 
the black circles illustrate the neurons a The third FN layer z® has dimension 


' Figures 7.2 and 7.9 are similar to Figure 1 in [122], and all FN network plots have been created 
with modified versions of the plot functions of the R package neuralnet [144]. 
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q3 = 10 providing the learned representation 2°" (x) € {1} x RB of x. This is 
used in the final GLM step (7.8) in the green box of Fig. 7.2. 


Remarks 7.2 


One distinguishes between FN networks of depth d = 1, called shallow 
networks, and FN networks of depth d > 1, called deep networks. In this 
sense, deep learning means that we learn suitable feature representations through 
multiple FN layers d > 1. We come back to this in Sect. 7.2.2, below. Remark 
that some people would only call a network deep if d >> 1, here d > 1 will be 
chosen for the definition of deep (which is also a precise definition). 

There are two ways of receiving a GLM. If we have a (trivial) FN network of 
depth d = 0, this naturally corresponds to a GLM, see Fig. 7.2. In that case, one 
works with the original features x € ¥ in (7.8). The second way of receiving a 
GLM is given by choosing the identity function as activation function @(x) = x. 
This implies that x —> ZED (x) = Ax is a linear function for some matrix 
A e R@«+Dx@+) and, henceforth, we receive a GLM. 

Under the above interpretation of the representation learning structure (7.7), we 
may also give a different intuition for the FN layers. Typically, we expect that 
the first FN layers decompose feature information x into bits and pieces, which 
are then recomposed in a suitable way for the prediction task. In this sense, we 
typically choose a larger dimension for the early FN layers otherwise we may 
lose too much information already from the very beginning. 

The neural network introduced in (7.7) is called FN network because the signals 
propagate from one layer to the next (directed acyclic graph). If the network 
has loops it is called a recurrent neural (RN) network. RN networks have been 
applied very successfully in image and speech recognition, for instance, long 
short-term memory (LSTM) networks are very useful for time-series analysis. 
We study RN networks in Chap.8, below. A third type of neural networks 
are convolutional neural (CN) networks which are very successfully applied 
to image recognition because they are capable to detect similar structures at 
different places in images, i.e., CN networks learn local representations. We will 
discuss CN network architectures in Chap. 9, below. 

The generic FN network architecture (7.8) can be complemented by drop- 
out layers, normalization layers, skip connections, embedding layers, etc. Such 
layers are special purpose layers, for instance, taking care of over-fitting. We 
introduce and discuss these below. 

The regression function (7.8) has a one-dimensional output for regression mod- 
eling. Of course, categorical classification can be done completely analogously 
by choosing a link function g suitable for classification, see Sect. 5.7. A similar 
approach also works if, for instance, we want to model simultaneously the mean 


and the dispersion of the data with a two-dimensional output function g~!. 
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7.2.2 Universality Theorems 


The use of FN networks for representation learning is motivated by the so-called 
universality theorems which say that any compactly supported continuous (regres- 
sion) function can be approximated arbitrarily well by a suitably large FN network. 
As such, we can understand the FN network framework as an approximation tool 
which, of course, is useful far beyond statistical modeling. In Chapter 12 we give 
some proofs of selected universality statements to illustrate the flavor of such results. 
In particular, Cybenko [86], Hornik et al. [192], Hornik [191], Leshno et al. [247], 
Park—Sandberg [293, 294], Petrushev [302] and Isenbeck—Riischendorf [198] have 
shown (under mild conditions on the activation function) that shallow FN networks 
can approximate any compactly supported continuous function arbitrarily well (in 
supremum norm or in L?-norm), if we allow for an arbitrary number of neurons qı € 
N in the single FN layer. Roughly speaking, such a result for shallow FN networks 
holds true if and only if the chosen activation function is non-polynomial, see 
Leshno et al. [247]. Such results are proved either by algebraic methods of Stone— 
Weierstrass type or by Wiener—Tauberian denseness type arguments. Moreover, 
approximation results are studied in Barron [25, 26], Yukich et al. [399], Makavoz 
[262], Pinkus [303] and Dohler—Riischendorf [108]. 

The above stated universality theorems say that shallow FN networks are 
sufficient from an approximation point of view. Nevertheless, we will mainly 
use deep (multiple layers) FN networks, below. These have better convergence 
properties to given function classes because they more easily promote interactions 
in feature components compared to shallow ones. Such questions have been studied, 
e.g., by Elbrachter et al. [120], Kidger—Lyons [215], Lu et al. [260] or Cheridito et 
al. [75]. For instance, Elbrichter et al. [120] compare finite-depth wide networks 
to finite-width deep networks (under the choice of the ReLU activation function), 
and they conclude that for many function classes deep networks lead to exponential 
approximation rates, whereas shallow networks only provide polynomial approxi- 
mation rates at the same number of network parameters. This motivates to consider 
sufficiently deep FN networks for representation learning because these typically 
have a better approximation capacity compared to shallow ones. 

We motivate this by two simple examples. For this motivation we use the step 
function activation ġ (x) = 1{x>0} € {0, 1}. If we have the step function activation, 
each neuron partitions R?”-! along a hyperplane, i.e., 


zh P(g) = owl”, z) = Ligt 


(m) 
“zg >= 
I1=1 W jl Z 


vin} € %1} (1.9) 


For a shallow FN network we can study the question of the maximal complexity 
of the resulting partition of the feature space ¥ C {1} x RI when considering q1 
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neurons (7.9) in the single FN layer z‘"). Zaslavsky [400] proved that qı hyperplanes 
can partition the Euclidean space R® in at most 


min{qo,q1} qı 
> ( ) disjoint sets. (7.10) 
j=0 


This number (7.10) can be seen as a maximal upper complexity bound for shallow 
FN networks with step function activation. It grows exponentially for q1 < go, and 
it slows down to a polynomial growth for q1 > qo. Thus, the complexity of shallow 
FN networks grows comparably slow as the width qı of the network exceeds go, and 
therefore we often need a huge network to receive a good approximation. 

This result (7.10) should be contrasted to Theorem 4 in Montúfar et al. [280] who 
give a lower bound on the complexity of regression functions of deep FN networks 
(under the ReLU activation function). Assume gm > qo for all 1 < m < d. The 
maximal complexity is bounded below by 


d—1 qo 90 
(m H ) > (“’) disjoint linear regions. (7.11) 


m=1 70 j=0 J 


If we choose as an example a FN network with fixed width qm = 4 for all m > 1 
and an input of dimension go = 2, we receive from (7.11) a lower bound of 


PET 4 Dl uaa 
(O O) = enon 


Thus, we have an exponential growth in depth d — oo. This contrasts the 
polynomial complexity growth (7.10) of shallow FN networks. 


Example 7.3 (Shallow vs. Deep Networks: Partitions) We give a second more 
explicit example that compares shallow and deep FN networks. Choose gg = 2 
and assume we want to describe a regression function 


u:R? >R, x> u(x). 


If we think of a tool box of basis functions to build regression function u we may 
want to choose indicator functions x + xa(x) € {0, 1} for arbitrary rectangles A = 
[x], x7) x [x , xe ) c R*. We show that we can easily construct such indicator 
functions x 4(x) for given rectangles A C R? with FN networks of depth d = 2, but 
not with shallow FN networks. 

For illustrative purposes, we fix a square A = [—1/2, 1/2) x [—1/2, 1/2) c R’, 
and we want to construct x4(x) with a network of depth d = 2. This indicator 
function x, is illustrated in Fig. 7.3. 


276 7 Deep Learning 


Fig. 7.3 Indicator function deep FN of depth d=2 
Xa (x) for square 
A= [-1/2, 1/2) x 
[—1/2, 1/2) c R? 


We choose the step function activation for @ and a first FN layer with gq; = 4 
neurons 


ve 
xr z” (x)= (1P, . iat =) 


a 
= (1, Dp s— 1/2}. Bre —1/2}s Bye1/2}s Lpreiy2y) € {1} x (0, 144. 


This FN layer has a network parameter, see also (7.9), 


1/2 1/2 —1/2 —1/2 
(wP... P) a oa o (7.12) 
1 0 1 


having dimension qi(qo + 1) = 12. For the second FN layer with q2 = 4 neurons 
we choose the step function activation and 


P 
pN Dd @) 


T 
= (1, Lizi+z223/2) Ufert-z3>3/2}s lzi+z423/2} l{z3+z4>3/2)) 
This FN layer has a network parameter 
—3/2 —3/2 —3/2 —3/2 


1 0 1 0 


(w?,...,w) = 0 
0 
1 


oor 
= = © 


1 
1 
0 
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having dimension q2(q1 + 1) = 20. For the output layer we choose the identity link 
g(x) = x, and the regression parameter 6 = (0, 1, —1, —1, 1)7 €e R5. As a result, 
we obtain 


xax) = (3 g(a). (7.13) 


That is, this network of depth d = 2, number of neurons (q1, q2) = (4,4), step 
function activation and identity link can perfectly replicate the indicator function for 
the square A = [—1/2, 1/2) x [—1/2, 1/2), see Fig. 7.3. This network has r = 37 
parameters. 

We now consider a shallow FN network with qı neurons. The resulting regression 
function with identity link is given by 


xe (8,20) = (8, 0, 219 E 


(8. (L toaa Heta Y 


where we have used the step function activation @(x) = {lix>0}. As in (7.9), 
each of these neurons leads to a partition of the space R? with a straight line. 
Importantly these straight lines go across the entire feature space, and, there- 
fore, we cannot exactly construct the indicator function of Fig.7.3 with a shal- 
low FN network. This can nicely be seen in Fig.7.4 (lhs), where we con- 
sider a shallow FN network with qı = 4 neurons, weights (7.12), and B = 
(0, 1/2, 1/2, —1/2, —1/2)". 

However, from the universality theorems we know that shallow FN networks 
can approximate any compactly supported (continuous) function arbitrarily well 
for sufficiently large q1. In this example we can introduce additional neurons and 
let the resulting hyperplanes rotate around the origin. In Fig. 7.4 (middle, rhs) we 
show this for qı = 8 and qı = 64 neurons. We observe that this allows us to 
approximate a circle, see Fig. 7.4 (rhs), and having circles of different sizes at 


shallow FN network q1=4 shallow FN network q1=8 shallow FN network q1=64 


-0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 
xi x1 xi 


Fig. 7.4 Shallow FN networks with qı = 4 (lhs), q1 = 8 (middle) and qı = 64 (ths) 
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different locations will allow us to approximate the square A considered above. 
However, of course, this is a much less efficient way compared to the deep FN 
network (7.13). 

Intuitively speaking, shallow FN networks act like additions where we add more 
and more separating hyperplanes for qı — œo (superposition of basis functions). 
In contrast to that, going deep allows us to not only use additions but to also use 
multiplications (composition of basis functions). This is the reason, why we can 
easily construct the indicator function xa in the deep case (where we multiply 
zero’s along the boundary of A), but not in the shallow case. a 


7.2.3 Gradient Descent Methods 


We describe gradient descent methods in this section. These are used to fit FN 
networks. Gradient descent algorithms have already been used in Sect. 6.2.4 for 
fitting LASSO regularized regression models. We will give the full methodological 
part here, without relying on Sect. 6.2.4. 


Plain Vanilla Gradient Descent Algorithm 


Assume we have independent instances (Yj, x;), 1 < i < n, that follow the same 
member of the EDF. We choose a regression function 


x; u(x) = fe (xi) = Eo, [Yi] = g7! (6. ele): 


for a strictly monotone and smooth link function g, and a FN network z‘) with 
network parameter # € R”. We assume that the chosen activation function @ is 
differentiable. We highlight in the notation that the mean functional u» (-) depends 
on the network parameter 3. The canonical parameter of the response Y; is given 
by 0(xi) = h(uə(xi)) € ©, where h = (k^)! is the canonical link and « the 
cumulant function of the chosen member of the EDF. This gives us (under constant 
dispersion g) the log-likelihood function, for given data Y = (Y%,..., Yn)!, 


8 > tr) = JO [Hino led) =e (hea en) | + aH v/o). 


i=l 
The deviance loss function in this model is given by, see (4.9) and (4.8), 
2 z Uj 
OY, t) = m 5 S (Yih (Yi) — « (h (Yi)) — Yih (uo (xi)) +x (h a) ) > 0. 


i=l 


(7.14) 
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The MLE of # is found by either maximizing the log-likelihood function or by 
minimizing the deviance loss function in #. This problem cannot be solved in 
general because of complexity. Typically, the deviance loss function is non-convex 
in # and it may have many local minimums. This is one of the reasons, why we 
are less ambitious here, and why we just try to find a network parameter # which 
provides a “small” deviance loss (Y, #) for the given data Y. We discuss this 
further, below, in fact, this is a crucial point in FN network fitting that is related to 
in-sample over-fitting and, therefore, this point will require a broader discussion. 

For the moment, we just try to find a network parameter # that provides a 
small deviance loss 9(Y, V) for the given data Y. Gradient descent algorithms 
suggest that we try to step-wise locally improve our current position by changing the 
network parameter into the direction of the maximal local decrease of the deviance 
loss function. By assumption, our deviance loss function is differentiable in 3. This 
allows us to consider the following first order Taylor expansion in # 


DY, 0) = DY, H) +D Y, 1)" (È — v) +o (1È — l2) as ||F—-P||2 > 0. 


This shows that the locally optimal change 3} => Ù points into the opposite direction 
of the gradient of the deviance loss function. This motivates the following gradient 
descent step. 


Assume that at algorithmic time t € N we have a network parameter # e€ 
R”. Choose a suitable learning rate 0;,; > 0, and consider the gradient 
descent update 


BO > FED = 9 — o1 VD Y, 0). (7.15) 


This gradient descent update gives us the new (smaller) deviance loss at 
algorithmic time t + 1 


2 
DY, 0) = DY, 0) — ora | VOY, aof +o(or1) for ars 40. 


Under suitably tempered learning rates (0;);>1, this algorithm will converge to a 
local minimum of the deviance loss function as £ — oo (supposed that we do not 
get trapped in a saddlepoint). 


Remarks 7.4 We give a couple of (preliminary) remarks on the gradient descent 
algorithm (7.15), more explanation, further derivations, and variants of the gradient 
descent algorithm will be discussed below. 
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e In the applications we will early stop the gradient descent algorithm before 
reaching a local minimum (to prevent from over-fitting). This is going to be 
discussed in the next paragraphs. 

e Fine-tuning the learning rate (o+); is important, in particular, there is a trade-off 
between smaller and bigger learning rates: they need to be sufficiently small so 
that the first order Taylor expansion is still a valid approximation, and they should 
be sufficiently big otherwise the convergence of the algorithm will be very slow 
because it needs many iterations. 

e The gradient descent algorithm is a first order algorithm, and one is tempted to 
study higher order approximations, e.g., leading to the Newton—Raphson algo- 
rithm. Unfortunately, higher order derivatives are computationally not feasible if 
the size n of the data Y = (Y,..., Yn)! and the dimension r of the network 
parameter # are large. In fact, even the calculation of the first order derivatives 
may be challenging and, therefore, stochastic gradient descent methods are 
considered below. Nevertheless, it is beneficial to have a notion of a second order 
term. Momentum-based methods originate from approximating the second order 
terms, these will be studied in (7.19)-(7.20), below. 

e The gradient descent step (7.15) solves an unconstraint local optimization. 
Similarly to (6.15)}-(6.16) we could change the gradient descent algorithm to 
a constraint optimization problem, e.g., involving a LASSO constraint that can 
be solved with the generalized projection operator (6.17). 


Gradient Calculation via Back-Propagation 


Fast gradient descent algorithms essentially rely on fast gradient calculations of the 
deviance loss function. Under the EDF setup we have gradient w.r.t. 3 


2 n ; 
WDY, 9) = = 5 = (uote) - Yh! (uo (xi)) Vo no (xi) (1.16) 
i=1 
2 vi My (xi) — Y; 1 


“a TE RE EET ETIE ED), 
n4 p V (Ma(Xi)) g'ueli)) 0 (B z“ x )) 


where the last step uses the variance function V (-) of the chosen EDF, we also refer 
to (5.9). The main difficulty is the calculation of the gradient 


Vo (B. 20 œ) = Vo (B, (2 0---02) œ), 


w.r.t. the network parameter } = wf, ae wi, D €e R”, and where each 
FN layer z™ involves the weights W™ = w, TEN wi”) E REm1+D Xam, 


The workhorse for these gradient calculations is the back-propagation method 
of Rumelhart et al. [324]. Basically, the back-propagation method is a clever 
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reparametrization of the problem so that the gradients can be calculated more easily. 
We therefore modify the a matrices WW by dropping the first row containing 
the intercept parameters wg) , 1 < j < qm. Define for! <m <d+1 


m) _ (m) qm-1X4q. 
Wo = (wh ja) oor «Ree, 
1<jm—1<4m-13 1<jm<qm 
where THUN j, denotes component jm—1 of w, and where we set gai1 = 1 
(output dimension) and = Bj, for 0 < ja < qa. 


Proposition 7.5 (Back-Propagation for the Hyperbolic Tangent Activation) 
Choose a FN network of depth d € N and with hyperbolic tangent activation 
Junction $(x) = tanh(x). 


° Define recursively 


— initialize qa4 = 1 and 84+ (x) = 1 € R@; 
— iterate ford >m-> 1 


I~) = diag (1 - - (0w) ‘) Weg? StD) e RM, 
a 1<jm<4m 


e We obtainforO <m <d 


a(B, z@)(x)) 
Jm:Jm+1 O<jm<qm; VS jim41S4m4+1 


= z™ D(x) "tD (x)" € Rt) Xqm+ | 


where 2°) (x) =x € RO+! and w}? = p e Rut, 


Proof of Proposition 7.5 Choose 1 < m < d and define for the neurons 1 < jm < 
qm the variables 


(x) m (ww, gmt). 


The learned representation in the m-th FN layer is obtained by activating these 
variables 


Oy (16 (4S) xed (<i) e Rat! 


For the output we define 


cD x) = (B, z EP (x). 


282 7 Deep Learning 


The main idea is to calculate the derivatives of (B,z“!(x)) wart. these new 
variables sg (x). 


Initialization form =d+1 This provides form = d+land1 < ja41 < qa41 = 1 


I(B, ZED) 


(d+1) 
— =l =ô (x). 

1 
ac) (x) 


Recursion form < d+1 Next, we calculate the derivatives w.r.t. ae (x), form = d 
and 1 < jg < qq. They are given by (note gg+1 = 1) 


HOG) BEG), aH) 


= 6 a) bu 7 CO) (7.17) 


d 


+ ; d 
= P(x) wt? (1- EP o a; 


(d+1) _ 
jal 7 
gp =1— p°. Continuing recursively ford > m > 1 and 1 < jm < qm we obtain 


where we have used w Êja and for the hyperbolic tangent activation function 


d: qm+1 ; (m+1) 
aB EDAD RAB, EDO) Wing 
(m) ~ (m+1) (m) 
Oo) jaiet PE O 8G; 
dm+1 i l F 
= D at ce waga (LEP coy?) = Pw. 
Jm+1=1 
Thus, the vectors 5° (x) = Gr"); Edas aes) are calculated recursively in 


d > m > 1 with initialization 6¢+ (x) = 1 and the recursion 


8x) = diag (i- ey) wine) Dor) e RM, 


1<jm <dm 


Finally, we need to show how these derivatives are related to the original 
derivatives in the gradient descent method. We have for 0 < ja < qa and ja+1 = 1 


ap EDAD O aB 2) aT PO) _ sat 


er F IE C (x) 24D (x). 
dB ja acD @) OB ja Jd+1 Ja 
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For | <m <d,and0 < jm < qm and 1 < jm+1 < dm+1 we have 


; : (m+1) 
a (B, z EP (x)) _ a(b, z TP (x)) OE aai (x) = ETD yy D ey) 
apart = aca) award = att Z m : 
Jm>Jm+1 Jm+1 Jims jm+1 


For m = 0, and O </ < qo and 1 < jı < qı we have 


aB, EDE) IB, EDE IEP) wy 
(1) = (1) 3 0 7 ô i (x) xı. 
dw; j, Tes (x) dwr i 


This completes the proof of Proposition 7.5. o 


Remark 7.6 Proposition 7.5 gives the back-propagation method for the hyperbolic 
tangent activation function which has derivative ¢’ = 1 — ¢?. This becomes visible 
in the definition of 6” (x) where we consider the diagonal matrix 


diag (1 = e) 


For a general differentiable activation function @ this needs to be replaced by, 
see (7.17), 


1< jm <dm 


diag Ci (w, ZG) 


1<jim<4m ` 
In the case of the sigmoid activation function this gives us, see also Table 7.1, 


diag (eG) (1 = 20 (x))) 


I< Jm <4m 


Plain vanilla gradient descent algorithm for FN networks 


1. Choose an initial network parameter #© € R”. 
2. Iterate for t > 0 until a stopping criterion is met: 


(a) Calculate the gradient V»O(Y, V) in network parameter # = po” 
using (7.16) and the back-propagation method of Proposition 7.5 (for the 
hyperbolic tangent activation function). 

(b) Make the gradient descent step for a suitable learning rate 0;41 > 0 


PO > FFD = 9 — or41 Ve DY, 8). 
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Remark 7.7 The initialization # € R” of the gradient descent algorithm needs 
some care. A FN network has many symmetries, for instance, we can permute 
neurons within a FN layer and we receive the same predictive model. For this 
reason, the initial network weights W") = w, EER wey e RG@m-14)x4m | 
1 < m < d, should not be chosen with identical components because this will 
result in a saddlepoint of the corresponding objective function, and gradient descent 
will not work. For this reason, these weights are initialized randomly either using a 
uniform or a Gaussian distribution. The former is related to the glorot_uniform 
initializer in keras,” see (16) in Glorot—Bengio [160]. This initializer scales the 


support of the uniform distribution with the sizes of the FN layers that are connected 
(m) 

j’ i 
For the output parameter we usually set as initial value B (9) — (By .0,..-, Ole 


by the corresponding weights w 


R+! where BY is the MLE in the corresponding null model (not considering any 
features) and transformed to the chosen link g. This choice implies that the gradient 
descent algorithm starts in the null model, and any decrease in deviance loss can be 
seen as an improved in-sample loss of using the FN network regression structure 
over the null model. 


Stochastic Gradient Descent 


The gradient in (7.16) has two parts. We have a vector 


vi 1 1 T 
Y)= | — i) — Yi) ——————-_——— R”, 
oes C (uen - 2) yrna Tae) pers s 


and we have a matrix 
M = (Va (8.2 cx1)),..., Vo (B.2Pn))) € R”. 
The gradient of the deviance loss function is obtained by the matrix multiplication 
Ve D(Y,0) = g M v(Y). 


Matrix multiplication can be very slow in numerical implementations if the 
sample size n is large. For this reason, one typically uses the stochastic gradient 
descent (SGD) method that does not consider the entire data Y = (Yj,..., on 
simultaneously. 


2 For our examples we use the R library keras [77] which is an API to TensorFlow [2]. 
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For the SGD method one chooses a fixed batch size b € N, and one randomly 
partitions the entire data Y into (mini-)batches Y1, ... , Y |njp| of approximately the 
same size b (up to cardinality). Each gradient descent update 


BO > HTD = 8 — 9,4, V9D(V5, 0), 


is then only based on the observations Y, in the corresponding batch 1 < s < [n/b]. 
Typically, one sequentially visits all batches, and screening each batch once is called 
an epoch. Thus, if we run the SGD algorithm over K epochs on batches of size 
b <n, then we perform K |n/b] gradient descent steps. 

Choosing batches of size b reduces the complexity of the matrix multiplication 
from n to b, and, henceforth, leads to much faster run times in one gradient 
descent step. On the other hand, batches should have a minimal size so that the 
gradient descent updates are not too erratic, i.e., if the batches are too small, the 
randomness in the data may point too often into a (completely) wrong direction for 
the optimal gradient descent step. For this reason, optimal batch sizes should be 
chosen carefully. For instance, if we study a low frequency claims count problem, 
say, with an expected frequency of A = 10%, we can determine confidence bounds 
for parameter estimation. This will provide an estimate of a minimal batch size b 
for a reliable parameter estimate. 

To have a few erratic steps in SGD, however, can also be beneficial, as long 
as there are not too many of those. Sometimes, the algorithm gets trapped in 
saddlepoints or in flat areas of the objective function (vanishing gradient problem). 
If this is the case, an erratic step may be beneficial because it may perturb the 
algorithm out of its bottleneck. In fact, often SGD has a better performance than the 
plain vanilla gradient descent algorithm that is based on the entire data Y because 
of these noisy contributions. 


Momentum-Based Gradient Descent Methods 


The gradient descent method only considers a first order Taylor expansion and one is 
tempted to consider higher order terms to improve the approximation. For instance, 
Newton’s method uses a second order Taylor term by updating 


-1 
BO > BHD = 9 — (VD, 9) VD, 9). (7.18) 


In many practical applications this calculation is not feasible as the Hessian 
VZD(Y, ®©) cannot be calculated in a reasonable amount of time. Another 
(simple) way of considering the changes in the gradients is the momentum-based 
gradient descent method of Rumelhart et al. [324]. This is inspired by mechanics in 
physics and it is achieved by considering the gradients over several iterations of the 
algorithm (with exponentially decaying weights). Choose a momentum coefficient 
v € [0, 1) and define the initial speed v® = 0 € R”. 


286 7 Deep Learning 


Replace the gradient descent update (7.15) by 


vO e vtd = py — o41V—e D(¥, 8), (7.19) 
BOR PED = gO VED, (7.20) 


For v = 0 we have the plain vanilla gradient descent method, for v > 0 we also 
memorize the previous gradients (with exponentially decaying weights). Typically 
this leads to better convergence properties. 

Nesterov [284] has noticed that for convex functions the gradient descent updates 
may have a zig-zag behavior. Therefore, he proposed the so-called Nesterov- 
accelerated version 


Ose wo payor ew), 
BO ps HTD = gO 4 yO, (7.21) 


v 


Thus, the calculation of the momentum v“+) uses a look-ahead o + pw in 
the gradient calculation (anticipating part of the next step). This provides for the 


update (7.21) the following equivalent versions, under reparametrization JO = 
I 4 py, 


BO) = 9 + (wv — a VD, OO + vv) 
= pO4 (vv — o1 VDF, 5°) (7.22) 
ag 4 (vet) — Or41VeD(Y, 5°) = pyr): 
For the Nesterov accelerated update we can also study, we use the last line of (7.22), 


vO BvD = py — 94 Ve D(Y, DO), 


JO o FEY = 704 (vv — o41VeD(Y, 5o). (1.23) 


Compared to (7.19)-(7.20), we just shift the index by | in the momentum v in 
the round brackets of (7.23). The typical way how the Nesterov-acceleration is 


formulated is, yet, another equivalent formulation, namely, only in terms of # and 


a. From the second line of (7.22) and (7.21) we have the updates 


( q(t) 


rt) — 7 _ o1 VD, 0), 


gerd = gta, (oe) z pO) f (7.24) 
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Typically, one chooses the momentum coefficient v in (7.24) time-dependent by 
setting v = t/(t + 3). 

In our applications we will use the R interface to the keras library [77]. 
This library has a couple of standard momentum-based gradient descent methods 
implemented which use pre-defined learning rates and momentum coefficients. In 
our analysis we are mainly relying on the variants rmsprop and the Nesterov- 
accelerated version of adam, called nadam. Therefore, we briefly describe these 
three variants, and for more information we refer to Sections 8.3 and 8.5 in 
Goodfellow et al. [166]. 


Predefined Gradient Descent Methods 

e xmsprop stands for ‘root mean square propagation’, and its origin can be 
found in a lecture of Hinton et al. [187]. Denote by © the Hadamard product 
that computes the component-wise products of two matrices. Choose a weight 
a € (0, 1) and calculate the accumulated squared gradients, set r =0eER’, 


r o rD = ar 4 1-2) (veo, 9) OVID, a) c€ R. 


The sequence (r®) t>1 memorizes the (squared) magnitudes of the components 
of the gradients VpD(Y, 0), t > 1. This is done individually for each 
component because we may have directional differences in magnitudes (and 
momentum). In contrast to (7.19), r“ does not model the speed, but rather an 
inverse weight. This then motivates the gradient descent update 


Q 
vert) 


where the square-root is taken component-wise, for a global decay rate 9 > 0, 
and for a small positive constant £ > 0 to ensure that everything is well-defined. 

e adam stands for ‘adaptive moment’ estimation, and it has been proposed by 
Kingma-—Ba [216]. The momentum is determined by the first two moments in 
adam, namely, we set vy = r© = 0 € R” and we consider 


BO > PD = gO — O Ve DY, #), 


+1 


Na 


vO bs vWD = py + (A — v) VAY, #), (7.25) 


rps rt) = ar + (1-a) (VoD, 0) © WDY, 9), (7.26) 


for given weights v,œ € (0,1). Similar to Bayesian credibility theory, v 
and r® are biased because these two processes have been initialized in zero. 
Therefore, they are rescaled by 1/(1 — v‘) and 1/(1 — a‘), respectively. This 
gives us the gradient descent update 


(t+1) 
v 
pO) py gD p0 Q =. 
rth 1 — v 
l1—g' 
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where the square-root is taken component-wise, for a global decay rate ọ > 0, 
and for a small positive constant £ > 0 to ensure that everything is well-defined. 
e nadam is the Nesterov-accelerated [284] version of adam. Similarly as when 
going from (7.19)-(7.20) to (7.23), the acceleration is obtained by a shift of 1 in 
the velocity parameter, thus, consider the Nesterov-accelerated adam update 


o vvtD + (1 — v) VD (Y, 8) 


PAG > ptt) = pO — 
ret) 1-—v! 


using (7.25) and (7.26). 


Maximum Likelihood Estimation and Over-fitting 


As explained above, we model the mean of the datum (Y, x) by a deep FN network 


x > W(X) = po) = Eon lY] = 87! (8, 2H), 


for a network parameter # € R”. MLE of this network parameter requires solving 
for given data Y 


PoE = argmin D(Y, v). 
v 


In Fig. 7.5 we give a schematic figure of a loss surface Ò œ> D(Y, #) for a (low- 
dimensional) example V € R. The two plots show the same loss surface from two 


different angles. This loss surface has three (local) minimums (red color), and the 


smallest one (global minimum) gives the MLE ae 


In general, this global minimum cannot be found for more complex network 
architectures because the loss surface typically has a complicated structure for high- 
dimensional parameter spaces. Is this a problem in FN network fitting? Not really! 
We are going to explain why. The universality theorems in Sect. 7.2.2 state that more 
complex FN networks have an excellent approximation capacity. If we translate 
this to our statistical modeling problem it means that the observations Y can be 
approximated arbitrarily well by sufficiently complex FN networks. In particular, 


for a given complex network architecture, the MLE oe wil provide the optimal 
fit of this architecture to the data Y, and, as a result, this network does not only 
reflect the systematic effects in the data but also the noisy part. This behavior is 
called (in-sample) over-fitting to the learning data £. It implies that such statistical 
models typically have a poor generalization to unseen (out-of-sample) test data 7; 


this is illustrated by the red color in Fig.7.6. For this reason, in general, we are 


not interested in finding the MLE PF of ð in FN network regression modeling, 


but we would like to find a parameter estimate P that (only) extracts the systematic 
effects from the learning data £. This is illustrated by the different colors in Figs. 7.5 
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Fig. 7.5 Schematic figure of a loss surface } > D(Y, V) from two different angles for a two- 
dimensional parameter # € R? 


Fig. 7.6 Schematic figure of in-sample over-fitting 
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under-fitting (blue) and 24 under-fitting 
extracting systematic effects —— systematic effects 
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and 7.6, where we assume: (a) red color provides models with a poor generalization 
power due to over-fitting, (b) blue color provides models with a poor generalization 
power, too, because these parametrizations do not explain the systematic effects in 
the data at all (called under-fitting), and (c) green color gives good parametrizations 
that explain the systematic effects in the data and generalize well to unseen data. 
Thus, the aim is to find parametrizations that are in the green area of Fig. 7.5. 
This green area emphasizes that we lose the notion of uniqueness because there 
are infinitely many models in the green area that have a comparable generalization 
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power. Next we explain how we can exploit the gradient descent algorithm to make 
it useful for finding parametrizations in the green area. 


Remark 7.8 The loss surface considerations in Fig. 7.5 are based on a fixed network 
architecture. Recent research promotes the so-called Graph HyperNetwork (GHN) 
that is a (hyper-)network which tries to find the optimal network architecture and 
its parametrization by an additional network, we refer to Zhang et al. [402] and 
Knyazev et al. [219]. 


Regularization Through Early Stopping 


As stated above, if we run the gradient descent algorithm with properly tempered 
learning rates it will converge to a local minimum of the loss function, which means 
that the resulting FN network over-fits to the learning data. For this reason we need 
to early stop the gradient descent algorithm beforehand. Coming back to Fig. 7.5, 
typically, we start the gradient descent algorithm somewhere in the blue area of 
the loss surface (supposed that the red area is a sparse set on the loss surface). 
Visually speaking, the gradient descent algorithm then walks down the valley (green, 
yellow and red area) by exploiting locally optimal steps. Since at the early stage of 
the algorithm the systematic effects play a dominant role over the noisy part, the 
gradient descent algorithm learns these systematic effects at this first stage (blue 
area in Fig. 7.5). When the algorithm arrives at the green area the noisy part in the 
data starts to increasingly influence the model calibration (gradient descent steps), 
and, henceforth, at this stage the algorithm should be stopped, and the learned 
parameter should be selected for predictive modeling. This early stopping is an 
implicit way of regularization, because it implies that we stop the parameter fitting 
before the parameters start to learn very individual features of the (noisy) data (and 
take extreme values). 

This early stopping point is determined by doing an out-of-sample analysis. This 
requires the learning data £ to be further split into training data U and validation 
data V. The training data U is used for gradient descent parameter learning, and 
the validation data V is used for tracking the over-fitting by an instantaneous (out- 
of-sample) validation analysis. This partition is illustrated in Fig. 7.7, which also 
highlights that the validation data V is disjoint from the test data 7, the latter only 
being used in the final step for comparing different statistical models (e.g., a GLM 
vs. a FN network). That is, model comparison is done in a proper out-of-sample 
manner on 7, and each of these models is only fit on U and V. Thus, for FN network 
fitting with early stopping we need a reasonable amount of data that can be split into 
3 sufficiently large data sets so that each is suitable for its purpose. 

For early stopping we partition the learning data £ into training data U and 
validation data V. The plain vanilla gradient descent algorithm can then be changed 
as follows. 
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Fig. 7.7 Partition of entire data D (lhs) into learning data £ and test data 7 (middle), and into 
training data U, validation data V and test data 7 (rhs) 


Plain vanilla gradient descent algorithm with early stopping 


1. Choose an initial network parameter © € R”. 
2. Iterate for t > 0 until the early stopping criterion is met: 


(a) Calculate the gradient V»>9(U/, #) in network parameter # = bv” on the 
training data U/ using (7.16) and the back-propagation method of Proposi- 
tion 7.5 (for the hyperbolic tangent activation function). 

(b) Make the gradient descent step for a suitable learning rate 0;41 > 0 


PO > PT) = 8 — 9,41 VDU, VO). 


(c) Calculate the validation loss D(V, #””) on the validation data V. 
(d) Stop the algorithm if the validation loss increases, i.e., if 


DV, 9) > DV, 867D), (7.27) 


and return the learned parameter (estimate) v= pD, 


In applications we use the SGD algorithm that can also have erratic steps because 
not all random (mini-)batches are necessarily typical representations of the data. 
In such cases we should use more sophisticated stopping criteria than (7.27), for 
instance, early stop if the validation loss increases five times in a row. 
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Fig. 7.8 Training loss stochastic gradient descent algorithm 
DU, po) vs. validation loss 
Dv, +) over different 
iterations t > 0 of the SGD 
algorithm 


—— training loss 
o validation loss 
~~ minimal validation loss 


(modified) deviance loss 
0.156 


0.154 


0.152 


T T T T 
0 200 400 600 800 1000 
training epochs 


Figure 7.8 provides an example of the application of the SGD algorithm on 
training data U/ and validation data V. The training loss is in blue color and the 
validation loss in green color. We observe that the validation loss has its minimum 
after 52 epochs (orange vertical line), and hence the fitting algorithm should be 
stopped at this point. We give a couple of remarks concerning Fig. 7.8: 


e The learning data £ exactly corresponds to the claims frequency data of 
Sect. 5.2.4, see also Table 5.2. We take 10% as validation data which gives 
[| = 549185 and |V| = 61/021. For the SGD algorithm we use batches of size 
10’000 which implies that one epoch corresponds to [549'185/10'000] = 54 
gradient descent steps. For batches of size 10'000 we expect an approximate 
estimation precision on an average frequency of A = 7.36% in the Poisson model 


of 
1-2 = Led 
100000 


with an average exposure v = 0.5283 on our learning data, we also refer to 
Example 3.22. 

e The FN network architecture used in Fig.7.8 is the one shown in Fig. 7.2 
using one-hot encoding for categorical variables, see Sect. 7.3.1, below, and the 
responses are modeled by a Poisson distribution. 

e The training loss DU, #), blue curve in Fig. 7.8, is a bit wiggly which comes 
from the fact that we use a SGD where not every batch leads to the optimal 
decrease in loss. Remark that the loss figures in the graph correspond to average 
losses over an entire epoch, i.e., in our case an average over 54 SGD steps. Also 
remark that the y-scale does not show the Poisson deviance loss: we use the loss 
figures provided by keras [77] and these figures drop all terms of the deviance 
loss that are not relevant for parameter estimation. 


À 
10/000v 


= [6.62%, 8.11%], 
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We close this section with remarks. 
Remarks 7.9 


e We perform early stopping because otherwise a complex FN network would 
in-sample over-fit to the learning data. At this stage, one could be tempted to 
choose a smaller network to prevent from over-fitting. In general, this is not a 
sensible thing to do because the network needs sufficient flexibility to be able to 
be fitted to the data. That is, we need some redundancy in the model to be able to 
successfully apply the SGD algorithm, otherwise the algorithm may get trapped 
in saddlepoints or bottlenecks. Thus, the chosen network architecture should be 
above the bound of a necessary minimal complexity, and different architectures 
above this bound will provide similar accuracy (without a clear winner). 

e The chosen network will contain certain elements of randomness, and different 
runs of the SGD algorithm will provide different solutions. Firstly, the initializa- 
tion V% e R” of the algorithm is chosen at random, and since we early stop 
the algorithm and because we do not have a unique optimal point, the chosen 
solution will depend on this random initialization. Secondly, the split between 
training and validation data is done at random, and thirdly the partitioning of the 
training data into mini-batches is done at random. All these random elements 
make the early stopped SGD solution non-unique. T 

e Early stopping implies that the chosen network parameter estimate ® does not 
correspond to a solution of the score equations and, henceforth, asymptotic 
results about MLEs do not apply, see Theorem 3.28. 
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7.3.1 Feature Pre-processing 


Similarly to GLMs, we also need to pre-process the feature components in FN 
network regression modeling. The former Sect. 5.2.2 for GLMs has been called 
‘feature engineering’ because we need to bring the feature components into an 
appropriate functional form w.r.t. the given regression task. The present section is 
called ‘feature pre-processing’ because we do not need to engineer the features for 
FN networks. We only need to bring them into a suitable (tabular) form to enter the 
network, and the network will then do an automated feature engineering through 
representation learning. 


Categorical Feature Components: One-Hot Encoding 


The categorical features have been treated by dummy coding within GLMs. Dummy 
coding provides full rank design matrices. For FN network regression modeling the 
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Table 7.2 One-hot encoding 


see pear a =white |1 [o [o [o [o Jo Jo Jo [o Jo Jo 
a a ai a =yellow |o |1 |o [o [o |o |o [o [o [o [o 
vectors of the 11-dimensional a3 = orange 0 }0 |1 /0 (0 {0 |0 |0 |0 |O |0 
Euclidean space R!! showing a4 = red o jo jo |1 jo jo jo jo jojojo 
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x} as row vectors as =violet |o [o [o [o Jo [1 lo [o [o lo Jo 
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ajo =beige |o [o [o [o [o [o [o Jo Jo |1 Jo 
a; =brown (O0 0 0/0 /O0 0 /0/OO[O 1I 


full rank property is not important because, anyway, we neither have a single (local) 
minimum in the objective function, nor do we want to calculate the MLE of the 
network parameter. Typically, in FN network regression modeling one uses one- 
hot encoding for the categorical variables that encodes every level by a unit vector. 
Assume the raw feature component x; is a categorical variable taking K different 
levels {a1, . . . , ag }. One-hot encoding is obtained by the embedding map 


T > xj SG atin leg) € 10 (7.28) 


An explicit example is given in Table 7.2 which should be compared to Table 5.1. 


Continuous Feature Components 


The continuous feature components do not need any pre-processing but they can 
directly enter the FN network which will take care of representation learning. 
However, an efficient use of gradient descent methods typically requires that all 
feature components live on a similar scale and that they are roughly uniformly 
spread across their domains. This makes gradient descent steps more efficient in 
exploiting the relevant directions. 

One possibility is to use the MinMaxScaler. Let x; and a be the minimal and 
maximal possible feature values of the continuous feature component xj, i.e., xj € 
[x; » x7]. We transform this continuous feature component to unit scale for all data 
1<i<nby 


xij xX; 

Xj > xMM 9 “i es — 1 e [-1,]]. (7.29) 
; mp 
j 


The resulting feature values Ga aes should roughly be uniformly spread 
across the interval [—1, 1]. If this is not the case, for instance, because we have 
outliers in the feature values, we may first transform them non-linearly to get 
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more uniformly spread values. For example, we consider the Density of the car 
frequency example on the log scale. 

An alternative to the MinMaxScaler is to consider normalization with the 
empirical mean x; and the empirical standard deviation ô; over all data x;, j. That 
is, 


xij e x = D, (7.30) 


It depends on the application whether the MinMaxScaler or normalization with 
the empirical mean and standard deviation works better. Important in applications 
is that we use exactly the same values for the normalization of training data U, 
validation data V and test data 7, to make the same network applicable to all 
these data sets. For notational convenience we will drop the upper index in hei 
or a, respectively, and we throughout assume that all feature components are 
appropriately pre-processed. 


7.3.2 Lab: Poisson FN Network for Car Insurance Frequencies 


We present a first FN network example applied to the French MTPL claim frequency 
data studied in Sect. 5.2.4. We assume that the claim counts N; are independent and 
Poisson distributed with claim count density (5.26), where we replace the GLM 
regression function x > exp(f, x) by a FN network regression function 


xEX w wx) = exp(ß, zP (Œ). 


We use a FN network of depth d = 3 having number of neurons (q1, q2, q3) = 
(20, 15, 10) and using the hyperbolic tangent activation function. We pre-process 
the categorical variables VehBrand and Region by one-hot encoding pro- 
viding input dimensions 11 and 22, respectively. The binary variable VehGas 
is encoded as 0-1. Because of scarcity of data we right-censor the continuous 
variables VehAge at 20, DrivAge at 90 and BonusMalus at 150, and we 
transform Density to the log scale. We then apply to each of these (modified) 
continuous variables Area, VehPower, VehAge, DrivAge, BonusMalus and 
log(Density) a MinMaxScaler. This provides us with an input dimension gg = 
11 +22 + 1 + 6 = 40. The resulting FN network is illustrated in Fig.7.2, with 
the one-hot encoded variables VehBrand in orange color and Region in magenta 
color. It has a network parameter 3 € R” of dimension r = 1306. 

This network is implemented in R using the library keras [77]. The code is 
provided in Listing 7.1 and the resulting network architecture is summarized in 
Listing 7.2. This network is now fitted to the data. We use a batch size of 10’000, 
we use the nadam version of SGD, we use 10% of the learning data £ as validation 
data V and the remaining 90% as training data U. We then run the corresponding 
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Listing 7.1 FN network of depth d = 3 using the R library keras [77] 


library (keras) 


# 

Design = layer_input(shape = c(40), dtype = ‘float32’, name = ‘Design’ ) 

Vol = layer_input(shape = c(1), dtype = ‘float32’, name = ‘Vol’ 

# 

Network = Design %>% 
layer_dense(units=20, activation=’tanh’, name=’FNLayerl’) %>% 
layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>% 
layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>% 
layer_dense(units=1, activation=’exponential’, name=’Network’, 

weights=list (array(0, dim=c(10,1)), array(log(lambda0), dim=c(1)))) 
# 


Response = list (Network, Vol) %>% layer multiply (name=’Multiply’ ) 

# 

model = keras_model(inputs = c(Design, Vol), outputs = c(Response) ) 
# 


summary (model) 


Listing 7.2 FN network illustrated in Fig. 7.2 


Layer (type) Output Shape Param # Connected to 
Design (InputLayer) (None, 40) =~“ C<“‘<S 
FNLayer1 (Dense) (None, 20) 820 Design[0] [0 
FNLayer2 (Dense) (None, 15) 315 FNLayer1[0] [0 
FNLayer3 (Dense) (None, 10) 160 FNLayer2 [0] [0 
Network (Dense) (None, 1) 11 FNLayer3 [0] [0 
Vol (InputLayer) (None, 1) 0 
Multiply (Multiply) (None, 1) 0 Network [0] [0] 
Vol [0] [0 


Total params: 1,306 
Trainable params: 1,306 
Non-trainable params: 0 


Listing 7.3 Fitting a FN network using the R library keras [77] 


patho <- "path _for_callback" 
CBs <- callback_model_checkpoint(path0, monitor = "val_loss", verbose = 0, 


save_best_only = TRUE, save_weights_ only = TRUE) 
# 
model %>% compile(loss = ‘poisson’, optimizer = ‘nadam’ ) 
fit <- model %>% fit(list(Xlearn, Vlearn), Ylearn, validation_split=0.1, 
batch_size=10000, epochs=1000, verbose=0, callbacks=CBs) 
# 


load_model_ weights hdf5(model, patho 


7.3 Feed-Forward Neural Network Examples 297 


Table 7.3 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of 
Table 5.5 and the FN network model (with one-hot encoding of the categorical variables) 


Run | # In-sample | Out-of-sample | Aver. 

time | param. | losson £ | loss on T freq. 
Poisson null - 1 25.213 25.445 7.36% 
Poisson GLM3 15s |50 24.084 | 24.102 7.36% 
One-hot FN (q1, g2, q3) = (20, 15, 10) | 51s [1306 | 23.757 23.885 6.96% 


SGD algorithm and we retrieve the network with the lowest validation loss using 
a callback. This is illustrated in Listing 7.3. The fitting performance on the 
training and validation data is illustrated in Fig. 7.8, and we retrieve the network 
calibration after the 52th epoch because it has the lowest validation loss. The results 
are presented in Table 7.3. 

From the results of Table 7.3 we conclude that the FN network outperforms 
model Poisson GLM3 (out-of-sample) since it has a (clearly) lower out-of-sample 
deviance loss on the test data 7. This may indicate that there is an interaction 
between the feature components that has not been captured in the GLM. The run 
time of 51s corresponds to the run time until the minimal validation loss is reached, 
of course, in practice we need to continue beyond this minimal validation loss to 
ensure that we have really found the minimum. Finally, and importantly, we observe 
that this early stopped FN network calibration does not meet the balance property 
because the resulting average frequency of this fitted model of 6.96% is below the 
empirical frequency of 7.36%. This is a major deficiency of this FN network fitting 
approach, and this is going to be discussed further in Sect. 7.4.2, below. 

We can perform a detailed analysis about different batch sizes, variants of SGD 
methods, run times, etc. We briefly summarize our findings; this summary is also 
based on the findings in Ferrario et al. [127]. We have fitted this model on batches 
of sizes 2’000, 5’000, 10’000 and 20’000, and it seems that a batch size around 
5000 has the best performance, both concerning out-of-sample performance and 
run time to reach the minimal validation loss. Comparing the different optimizers 
rmsprop, adam and nadam, a clear preference can be given to nadam: the 
resulting prediction accuracy is similar in all three optimizers (they all reach the 
green area in Fig.7.5), but nadam reaches this optimal point in half of the time 
compared to rmsprop and adam. 

We conclude by highlighting that different initial points # of the SGD 
algorithm will give different network calibrations, and differences can be consid- 
erable. This is discussed in Sect. 7.4.4, below. Moreover, we could explore different 
network architectures, more simple ones, more complex ones, different activation 
functions, etc. The results of these different architectures will not be essentially 
different from our results, as long as the networks are above a minimal complexity 
bound. This closes our first example on FN networks and this example is the 
benchmark for refined versions that are presented in the subsequent sections. 
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7.4 Special Features in Networks 
7.4.1 Special Purpose Layers 


So far, our networks consist of stacked FN layers, and information is passed in a 
directed acyclic feed-forward path from one to the next FN layer. In this section we 
discuss special purpose layers that perform a specific task in a FN network. These 
include embedding layers, drop-out layers and normalization layers. These modules 
should be seen as add-ons to the FN layers. Besides these add-ons, there are also 
recurrent layers and convolutional layers. These two types of layers are going to be 
discussed in own chapters, below, because their importance goes beyond just being 
add-ons to the FN layers. 


Embedding Layers for Categorical Feature Components 


The categorical feature components have been treated either by dummy coding or 
by one-hot encoding, and this has resulted in numerous network parameters in the 
first FN layer, see Fig. 7.2. Natural language processing (NLP) treats categorical 
feature components differently, namely, it embeds categorical feature components 
(or words in NLP) into a Euclidean space R? of a small dimension b. This small 
dimension b is a hyper-parameter that has to be selected by the modeler, and which, 
typically, is selected much smaller than the total number of levels of the categorical 
feature. This embedding technique is quite common in NLP, see Bengio et al. [27— 
29], but it goes beyond NLP applications, see Guo—Berkhahn [176], and it has been 
introduced to the actuarial community by Richman [312, 313] and the tutorial of 
Schelldorfer—Wiithrich [329]. 

We assume the same set-up as in dummy coding (5.21) and in one-hot encod- 
ing (7.28), namely, that we have a raw categorical feature component x; taking K 
different levels {a;, . . . , ax}. In one-hot encoding these K levels are mapped to the 
K unit vectors of the Euclidean space R* , and consequently all levels have the same 
mutual Euclidean distance. This does not seem to be the best way of comparing the 
different levels because in our regression analysis we would like to identify the 
levels that are more similar w.r.t. the regression task and, thus, these should cluster. 
For an embedding layer one chooses a Euclidean space R? of a dimension b < K, 
typically being (much) smaller than K. One then considers the embedding map 


e:{aj,...,ax} > RÈ, area) & e®. (7.31) 


That is, every level ag receives a vector representation e% e RP which is 
lower dimensional than its one-hot encoding counterpart in R£. Proximity of the 
representations e and e® in RÈ , i.e., of two levels ag and ay, should be related 
to similarity w.r.t. the regression task at hand. Such an embedding involves K 
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Fig. 7.9 (lhs) One-hot encoding with qo = 40, and (rhs) embedding layers for VehBrand and 
Region with embedding dimension b = 2 and go = 11; the remaining network architecture is 
identical with (q1, q2, 43) = (20, 15, 10) for depth d = 3 


vectors e € R? of dimension b, thus, it involves Kb parameters, called embedding 
weights. 

In network modeling, these embedding weights e, ..., e®) can also be learned 
during gradient descent training. Basically, it just means that for the categorical 
variables we add an additional embedding layer before the first FN layer z“, i.e., 
we increase the depth of the network by 1 for the categorical feature components 
(by a layer that is not fully connected). This is illustrated in Fig.7.9 (rhs) for 
the French MTPL insurance example of Sect.7.3.2. The graph on the left-hand 
side shows the network if we apply one-hot encoding to the categorical variables 
VehBrand and Region; this results in a network parameter of dimension r = 
1'306. The graph on the right-hand side first embeds VehBrand and Region 
into two 2-dimensional spaces, illustrated by the orange and magenta circles. These 
embeddings are concatenated with the remaining feature components, which then 
provides a new dimension gg = 7+ 2 + 2 = 11 in that example. This results in a 
network parameter of dimension r = 726 + 22 + 44 = 792, where 22 + 44 = 66 
stands for the 2-dimensional embedding weights of the 11 VehBrands and the 22 
French Regions, see Listing 7.5. 


Example 7.10 (Embedding Layers for Categorical Features) We revisit the exam- 
ple of Sect. 7.3.2, but we replace one-hot encoding of the categorical variables by 
embedding layers of dimension b = 2. The corresponding R code is given in 
Listing 7.4 and the resulting model is illustrated in Listing 7.5 and Fig. 7.9 (rhs). 

Apart from replacing one-hot encoding by embedding layers, we use exactly 
the same FN network architecture as in Sect.7.3.2 and we apply the same fitting 
strategy in terms of batch sizes, optimizer and early stopping strategy. The results 
are presented in Table 7.4. 


CmMOAIDMPWNK 


300 7 Deep Learning 


Listing 7.4 FN network of depth d = 3 using embedding layers 


Design = layer_input (shape = c(7), dtype = ‘float32’, name = ‘Design’) 
VehBrand = layer_input(shape = c(1), dtype = ‘'int32’, name = ‘VehBrand’ ) 
Region = layer_input (shape = c(1), dtype = ‘int32’, name = ‘Region’ ) 
Vol = layer_input (shape = c(1), dtype = ‘float32’, name = ‘Vol’) 

# 


BrandEmb = VehBrand %>% 

ayer_embedding(input_dim=11, output_dim=2,input_length=1,name=’BrandEmb’) %>% 
ayer _flatten(name=’Brand_ flat’) 

RegionEmb = Region %>% 
ayer_embedding(input_dim=22,o0utput_dim=2,input_length=1,name=’RegionEmb’) %>% 
ayer _flatten(name=’Region flat’) 


# 
Network = list (Design, BrandEmb,RegionEmb) %>% layer _concatenate(name=’ concate’ ) 

ayer_dense(units=20, activation=’tanh’, name='’FNLayerl’) %>% 

ayer _dense(units=15, activation=’tanh’, name=’FNLayer2’) %>% 

ayer _dense(units=10, activation=’tanh’, name=’FNLayer3’) %>% 

ayer _dense(units=1, activation=’exponential’, name=’Network’, 

weights=list (array(0, dim=c(10,1)), array(log(lambda0), dim=c(1)))) 

# 


Response = list (Network, Vol) %>% layer multiply (name=’Multiply’ ) 

# 

model = keras_model(inputs = c(Design, VehBrand, Region, Vol), 
outputs = c(Response) ) 


Table 7.4 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107?) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of 
Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension 
b = 2, respectively) 


Run i In-sample | Out-of-sample | Aver. 

time | param. | loss on £ | loss on T freq. 
“Poisson null f- [i [25.213 [25.445 | 7.36% 
Poisson GLM3 [15s |50 [24.084 [24.102 | 7.36% 
One-hot FN (q1, 42, 93) = (20, 15, 10) |51s |1306 | 23.757 | 23.885 | 6.96% 
Embed FN (41, 92, 93) = (20, 15, 10) | 120s |792 | 23.694 | 23.820 | 7.24% 


A first remark is that the model calibration takes longer using embedding layers 
compared to one-hot encoding. The main reason for this is that having an embedding 
layer increases the depth of the network by one layer, as can be seen from Fig. 7.9. 
Therefore, the back-propagation takes more time, and the convergence is slower 
requiring more gradient descent steps. We have less over-fitting as can be seen from 
Fig. 7.10. The final fitted model has a slightly better out-of-sample performance 
compared to the one-hot encoding one. However, this slight improvement in the 
performance should not be overstated because, as explained in Remarks 7.9, there 
are a couple of elements of randomness involved in SGD fitting, and choosing 
a different seed may change the results. We remark that the balance property is 
not fulfilled because the average frequency of the fitted model does not meet the 
empirical frequency, see the last column of Table 7.4; we come back to this in 
Sect. 7.4.2, below. 
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Listing 7.5 Summary of FN network of Fig. 7.9 (rhs) using embedding layers of dimension b = 2 


Output Shape 


Param # Connected to 


VehBrand (InputLayer) (None, 1) 0 

Region (InputLayer) (None, 1) 0 

BrandEmb (Embedding) (None, 1, 2) 22 VehBrand [0] [0] 

RegionEmb (Embedding) (None, 1, 2) 44 Region [0] [0 

Design (InputLayer) (None, 7) 0 

Brand flat (Flatten) (None, 2) (0) BrandEmb [0] [0] 

Region flat (Flatten) (None, 2) 0 RegionEmb [0] [0] 

concate (Concatenate) (None, 11) 0 Design [0] [0 
Brand_flat[0] [0] 
Region flat[0] [0] 

FNLayerl (Dense) (None, 20) 240 concate [0] [0] 

FNLayer2 (Dense) (None, 15) 315 FNLayer1 [0] [0 

FNLayer3 (Dense) (None, 10) 160 FNLayer2 [0] [0 

Network (Dense) (None, 1) 11 FNLayer3 [0] [0 

Vol (InputLayer) (None, 1) (0) 

Multiply (Multiply) (None, 1) 0 Network [0] [0] 
Vol [0] [0] 


Trainable params: 792 
Non-trainable params: 0 


Fig. 7.10 Training loss 

DU, #) vs. validation loss 
Dv, 8) over different 
iterations t > 0 of the SGD 
algorithm in the deep FN 
network with embedding 
layers for categorical 
variables 
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Fig. 7.11 Embedding weights eY°®®™2™4 e R? and e®°94°" €e R? of the categorical variables 
VehBrand and Region for embedding dimension b = 2 


A major advantage of using embedding layers for the categorical variables is that 
we receive a continuous representation of nominal variables, where proximity can be 
interpreted as similarity for the regression task at hand. This is nicely illustrated in 
Fig. 7.11 which shows the resulting 2-dimensional embeddings eV°®*2"4 € R? and 
eRegion € R? of the categorical variables VehBrand and Region. The Region 
embedding e®©91° € RR? shows surprising similarities with the French map, for 
instance, Paris region R11 is adjacent to R23, R22, R21, R26, R24 (which is also 
the case in the French map), the Isle of Corsica R94 and the South of France R93, 
R91 and R73 are well separated from other regions, etc. Similar observations can 
be made for the embedding of VehBrand, Japanese cars B12 are far apart from the 
other cars, cars B1, B2, B3 and B6 (Renault, Nissan, Citroen, Volkswagen, Audi, 
Skoda, Seat and Fiat) cluster, etc. a 


Drop-Out Layers and Regularization 


Above, over-fitting to the learning data has been taken care of by early stopping. In 
view of Sect. 6.2 one could also use regularization. This can easily be obtained by 
replacing (7.14), for instance, by the following L?-regularized counterpart 


n 


2 i 
ad z D o (Ya (V;)—K (h (Yi))—Y;ih (ne (xi))+« (h (ua Œi))) ) +2 l-l}, 


i=1 


for some p > 1, regularization parameter A > O and where the reduced network 
parameter #_ € R”! excludes the intercept parameter Bo of the output layer, 
we also refer to (6.4) in the context of GLMs. For grouped penalty terms we 


e 


SCOAANDUNESEWNHKE 
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refer to (6.21). The difficulty with this approach is the tuning of the regularization 
parameter(s) A: run time is one issue, suitable grouping is another issue, and non- 
uniqueness of the optimal network a further one that can substantially distort the 
selection of reasonable regularization parameters. 

A more popular method to prevent from over-fitting individual neurons in a FN 
layer to a certain task are so-called drop-out layers. A drop-out layer is an additional 
layer between FN layers that removes at random during gradient descent training 
neurons from the network, i.e., in each gradient descent step, any of the earmarked 
neurons is offset independently from the others with a fixed probability ô € (0, 1). 
This random removal will imply that the composite of the remaining neurons needs 
to be sufficiently well balanced to take over the role of the dropped-out neurons. 
Therefore, a single neuron cannot be over-trained to a certain task because it needs 
to be able play several different roles. Drop-out has been introduced by Srivastava 
et al. [345] and Wager et al. [373]. 


Listing 7.6 FN network of depth d = 3 using a drop-out layer, ridge regularization and a 
normalization layer 


Network = list (Design,BrandEmb,RegionEmb) %>% 


layer_concatenate(name='concate’) %>% 

layer_dense(units=20, activation=’tanh’, name=’FNLayerl’) %>% 

layer dropout (rate = 0.01) %>% 

layer_dense(units=15, kernel _regularizer=regularizer_12(0.0001), 
activation=’tanh’, name=’FNLayer2’) %>% 

layer_batch_normalization() %>% 

layer_dense(units=10, activation=’tanh’, name=’FNLayer3’) %>% 

layer_dense(units=1, activation='’exponential’, name=’Network’, 

weights=list (array(0, dim=c(10,1)), array(log(lambda0), dim=c(1)))) 


Listing 7.6 gives an example, where we add a drop-out layer with a drop-out 
probability of ô = 0.01 after the first FN layer, and in the second FN layer we apply 
ridge regularization to the weights wP, er we p), i.e., excluding the intercepts 


why» 1 < j < q2. Both the drop-out layer and regularization are only used during 


the gradient descent fitting, and these network features are disabled during the 
prediction. 

Drop-out is closely related to ridge regularization as the following linear 
Gaussian regression example shows; this consideration is taken from Section 18.6 
of Efron—Hastie [117]. Assume we have a linear regression problem with square 
loss function 


1 n 
DY, B)= 5) Yi- (B xi). 
i=l 


We assume in this Gaussian case that the observations and the features are 
standardized, see Sect. 6.2.4. This means that }~"_, Y; = 0, )°7_, xi,; = 0 and 
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n`! an xj = l, for all 1 < j < q. This standardization implies that we can 
omit the intercept parameter Bo because its MLE is equal to 0. 

We introduce i.i.d. drop-out random variables I; j; for 1 <i <nand1< j <q 
with (1 — 6)J;,; being Bernoulli distributed with probability 1 — ô € (0, 1). This 
scaling implies E[J;,;] = 1. Using these Bernoulli random variables we modify the 
above square loss function to 


2 
n 


1 q 
DY, B)=5) |5- 2 Pjlijxij | > 
J> 


i=l 


i.e., every individual component x;,; can drop out independently of the others. 
Gaussian MLE requires to set the gradient of D7(Y, B) w.rt. B € R? equal to 
zero. The average score equation is given by (we average over the drop-out random 
variables J; j) 


8 n n 
3 [Ved (Y, B)| Y] = -£' Y + X' XE + —— diag (Sati Dat) 
i=] i=l 


on ! 
> 0, 
138 


=X Y + XTX + 


where we have used the normalization of the columns of the design matrix X e€ 
R”*4 (we drop the intercept column). This is ridge regression in the linear Gaussian 
case with a regularization parameter à = 6/(2(1 — 5)) > 0 for ô € (0, 1), see (6.9). 


Normalization Layers 


In (7.29) and (7.30) we have discussed that the continuous feature components 
should be pre-processed so that all components live on the same scale, otherwise the 
gradient descent fitting may not be efficient. A similar phenomenon may occur with 
the learned representations z“”"*!) (x;) in the FN layers 1 < m < d. In particular, this 
is the case if we choose an unbounded activation function @. For this reason, it can 
be advantageous to rescale the components a (xi), 1 < j < qm, in a given FN 
layer back to the same scale. To achieve this, a normalization step (7.30) is applied 
to every neuron rs a (x;) over the given cases i in the considered (mini-)batch. This 
involves two more parameters (for the empirical mean and the empirical standard 
deviation) in each neuron of the corresponding FN layer. Note, however, that all 
these operations are of a linear nature. Therefore, they do not affect the predictive 
model (i.e., these operations cancel in the scalar products in (7.6)), but they may 
improve the performance of the gradient descent algorithm. 

The code in Listing 7.6 uses a normalization layer on line 6. In our applications, 
it has not been necessary to use these normalization layers, as it has not led to better 
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run times in SGD algorithms; note that our networks are not very deep and they use 
the symmetric and bounded hyperbolic tangent activation function. 


7.4.2 The Balance Property in Neural Networks 


We have seen in Table 7.4 that our FN network outperforms the GLM for claim 
frequency prediction in terms of a lower out-of-sample loss. We interpret this as 
follows. Feature engineering has not been done in the most optimal way for the 
GLM because the FN network finds modeling structure that is not present in the 
selected GLM. As a consequence, the FN network provides a better generalization 
to unseen data, i.e., we can better predict new data on a granular level with the FN 
network. However, having a more precise model on an individual policy level does 
not necessarily imply that the model also performs better on a global portfolio level. 
In our example we see that we may have smaller errors on an individual policy level, 
but these smaller errors do not aggregate to a more precise model in the average 
portfolio frequency. In our case, we have a misspecification of the average portfolio 
frequency, see the last column of Table 7.4. This is a major deficiency in insurance 
pricing because it may result in a misspecification of the overall price level, and this 
requires a correction. We call this correction bias regularization. 


Simple Bias Regularization 


The straightforward correction is to adjust the intercept parameter Bo € R 
accordingly. That is, compare the empirical mean 


ia vY; 


w= ae v; ’ 


to the model average of the fitted FN network 


~ Pi vingi) 
U= — 5n > 
iat Yi 
where # = ON AEE D, AT € R” is the learned network parameter from the 


(early stopped) SGD algorithm. The output of this fitted model reads as 


dd 
eu A~ , = ~ nm dil 
xj > ug ei) = eB, 2% HD) = 8" | Bot DBP |, 
j=l 
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where the hat inZ” indicates that we use the estimated weights a”, 1<l< 4m» 
1 <m < d, in the FN layers. The balance property can be ra by replacing Bo 


by the solution Bo of the following identity 


n n qd 

! a A T A~(d:1 
y viY; = X vig”! Bo+) Biz" (xj) 
i=l i=l 


j=l 


Since g7! is continuous and strictly monotone, there is a unique solution to this 
requirement supposed that the range of g7! covers the support of the Y;’s. If we 


work with the log-link g(-) = log(-), this can easily be solved and we obtain 


Bo = fy + tog (2). 
m 


Sophisticated Bias Regularization Under the Canonical Link Choice 


If we work with the canonical link g = h = (k’ )—!, we can do better because the 
MLE of such a GLM automatically provides the balance property, see Corollary Suds 
Choose the SGD learned network parameter v= aP, Lea DO, B B)' eR. 


Denote by ZD the fitted network architecture that is based on the estimated 
D, 


weights Ù e DO. This allows us to study the learned representations of the 
raw Batat A ,---, Xn in the last FN layer. We denote these learned representations 
by 


ie i) oeo Ba ee M x RE. (7.32) 


These learned representations can be used as new features to explain the response 
Y. We define the feature engineered design matrix by 


E= E., n) € Rat, 
Based on this new design matrix % we can run a classical GLM receiving a unique 
MLE PTF e R+! supposed that this design matrix has a full rank qa + 1 < n, 


see Proposition 5.1. Since we work with the canonical link, this re-calibrated FN 
network will automatically satisfy the balance property, and the resulting regression 
function reads as 


xp ME SI ee), 33) 


COIYDMPWNK 
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This is the proposal of Wiithrich [390]. We give some remarks. 
Remarks 7.11 


e This additional MLE step for the output parameter B € R%*! may lead to 
over-fitting. In that case one might choose a lower dimensional last FN layer. 
Alternatively, one might explore a more early stopping rule in SGD. 

e Wiithrich [390] also explores other bias correction methods like regularization 
using shrinkage. In combination with regression trees one can achieve averages 
on pre-defined sub-portfolios. We will not further explore these other approaches 
because they are less robust and more difficult in the applications. 


Example 7.12 (Balance Property in Networks) We apply this additional MLE step 
to the two FN networks of Table 7.4. Note that in these two examples we consider 
a Poisson model using the canonical link for g, thus, the resulting adjusted 
network (7.33) will automatically satisfy the balance property, see Corollary 5.7. 


Listing 7.7 Balance property adjustment (7.33) 


glm.formula <- function (nn) { 
string s- "yy = X1" 
if (nn>1){for (11 in 2:nn){ string <- paste(string, "+X",11, sep="") }} 
string 


} 


zz <- keras_model (inputs=modelS$input, 

outputs=get_layer(model, ‘FNLayer3’ ) $output) 
xx.learn <- data.frame(zz %>% predict (list (Xlearn, Vlearn) ) ) 
q3 <- ncol (xx.learn) 
xx.learn$yy <- Ylearn 
xx.learn$Exposure <- learnSExposure 


# 
glm1 <- glm(as.formula(glm.formula(q3)), 
data=xx.learn, offset=log(Exposure), family=poisson() 
# 
wl <- get_weights (model) 
w1[[7]] <- array (glm1$coefficients[2: (q3+1)], dim=c(q3,1) 
w1[[8]] <- array(glml$coefficients[1], dim=c (1) 


set_weights (model, w1) 


In Listing 7.7 we illustrate the necessary code that has to be added to List- 
ings 7.1-7.3. On lines 7—8 of Listing 7.7 we retrieve the learned representa- 


tions (7.32) which are used as the new features in the Poisson GLM on lines 13-14. 


: MLE — + 
The resulting MLE £ € R%*+! is imputed to the network parameter ¥ on 


lines 17-20. Table 7.5 shows the performance of the resulting bias regularized FN 
networks. 

Firstly, we observe from the last column of Table 7.5 that, indeed, the bias 
regularization step (7.33) provides the balance property. In general, in-sample losses 


(have to) decrease because e is (in-sample) more optimal than the early stopped 
SGD solution 8. Out-of-sample this leads to a small improvement in the one- 
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Table 7.5 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of 
Table 5.5 and the FN network models (with one-hot encoding and embedding layers of dimension 
b = 2, respectively), and their bias regularized counterparts 


Run |# In-sample Out-of-sample Aver. 

time | param. | loss on £ | loss on 7 freq. 
Poisson null - 1 25.213 25.445 7.36% 
Poisson GLM3 15s |50 24.084 24.102 7.36% 
One-hot FN (q1, q2, q3) = (20, 15, 10) |51s | 1°306 | 23.757 23.885 6.96% 
Embed FN (q1, q2, 93) = (20,15,10) |120s |792 [23.694 [23.820 7.24% 
One-hot FN bias regularized | +4s | 17306 | 23.742 23.878 7.36% 
Embed FN bias regularized +4s |792 | 23.690 23.824 7.36% 


hot encoded variant and a small worsening in the embedding variant, i.e., the 
latter slightly over-fits in this additional MLE step. However, these differences are 
comparably small so that we do not further worry about the over-fitting, here. This 
closes this example. a 


Auto-Calibration for Bias Regularization 


We present another approach of correcting for the potential failure of the balance 
property. This method does not depend on a particular type of regression model, 
i.e., it can be applied to any regression model. This proposal goes back to Denuit et 
al. [97], and it is based on the notion of auto-calibration introduced by Patton [297] 
and Kriiger—Ziegel [227]. We first describe auto-calibration and its implications. 


Definition 7.13 The random variable Z is an auto-calibrated forecast of random 
variable Y if E[Y |Z] = Z, a.s. 


If the response Y is described by the features X = x, we consider the conditional 
mean of Y, given X, 


u(X) = E [Y|X]. 


This conditional mean u(X) is an auto-calibrated forecast for the response Y. Use 
the tower property and note that o (u(X)) C o(X) to receive, a.s., 


2 [Y| u(X)] = 


kA 
a 


2 [Y| XI| u(X)] = E [w(X)| u(X)] = u(%). 


For the further understanding of auto-calibration and forecast dominance, we 
introduce the concept of convex order; forecast dominance has been introduced in 
Definition 4.20. 
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Definition 7.14 (Convex Order) A random variable Z; is bigger in convex order 
than a random variable Z2, write Zi >cx Z2, if E[W(Z,)] > E[W(Z2)], for all 
convex functions Y for which the expectations exist. 


By Strassen’s aie a Z| =x A if and only if there exist random variables 
Zi and Z/ with z ® ) Zi and Z2 @ Z, and E[Z{|Z5] = Z⁄, a.s. In particular, 
ie convex order Z1 >cx A implies that Var(Z1) > Var(Z2) and E[Z;] = E[ Z2]. 
The latter follows from Strassen’s theorem and the tower property, and the former 
follows from the latter and the convex order by using the explicit choice Y (x) = x?. 
Thus, the random variable Z; is more volatile than Z2, both having the same mean. 
The following theorem shows that this additional volatility is a favorable property 
in terms of forecast dominance under auto-calibration. 


Theorem 7.15 (Kriiger—Ziegel [227, Theorem 3.1], Without Proof) Assume that 
Tı and fz are auto-calibrated forecasts for the random variable Y. Predictor fı 
forecast dominates {12 if and only if 1, cx f2. 


Recall that forecast dominance of {fı over {12 was defined as follows, see Defini- 
tion 4.20, 


E [Dy (Y, @1)] < E [Dy Œ, @2)], 


for all Bregman divergences Dy. Strassen’s theorem tells us that £ is more volatile 
than {2 (both being auto-calibrated and unbiased for E[Y]) and this additional 
volatility implies that the former auto-calibrated predictor can better follow Y. This 
provides the superior forecast dominance of ji; over f2. This relation is most easily 
understood by the following example. Consider (Y, X) as above. Assume that the 
feature X is a sub-variable of the feature X by dropping some of the components 
of X. Naturally, we have o (X ) C o (X), and both sets of information provide auto- 
calibrated forecasts 


u(X) = E[Y|X] and u(X) =E[Y |X]. 


The tower property and Jensen’s inequality give for any convex function Y (subject 
to existence) 


2 [W(u(X))] = E[W Œ [YX] = E [E [Y Ely |x) |X] 
> E [Y (E[E[Y/X]|X])] = E [Y (E[Y |X])] = E [Y (u®)]. 


Thus, we have u(X) >cx u(X ) which implies forecast dominance of u(X) over 
p(X ). This makes perfect sense in view of o (X ) C o (X). Basically, this describes 
the construction of a F-martingale using an integrable random variable Y and a 
filtration F on the underlying probability space (Q, A, P). This martingale sequence 
provides forecast dominance with increasing information sets described by the 
filtration F 
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We now turn our attention to the balance property and the unbiasedness of 
predictors, this follows Denuit et al. [97]. Assume we have any predictor f(x) of 
Y, for instance, this can be any FN network predictor w(x) coming from an early 
stopped SGD algorithm. We define its balance-corrected version by 


tgc(x) = E [Y |a(x)] . (7.34) 


Proposition 7.16 (Wiithrich [391, Proposition 4.6], Without Proof) The 
balance-corrected predictor jipc(X) is an auto-calibrated forecast for Y. 


Remarks 7.17 (Expected Deviance Generalization Loss) We return to the decom- 
position of the expected deviance GL given in Theorem 4.7, but we add the features 
X = x, now. The expected deviance GL of a predictor A(X) under the unit deviance 
0 then reads as 


Eg [0 (Y, “(X))] = Ep [0 (Y, u)] 


+ 2(wh(w) — k (h(u)) — Eg [Yh @(X))] + Eg [k (h RI), 


where u = Eọ[Y] is the unconditional mean of Y (averaging also over the feature 
distribution of X). Note that this formula differs from (4.13) because Y and h(ji(X)) 
are no longer independent if we include the features X. The term Eọ [0 (Y, j2)] is 
called the entropy which is driven by the stochastic nature of the random variable 
Y. This is the irreducible risk if no feature information is available. 

In statistical modeling one considers different decompositions of the expected 
deviance GL, we refer to Fissler et al. [129]. Namely, introducing the features X 
we can reduce the expected deviance GL compared to the unconditional mean u in 
terms of forecast dominance. This allows us to decouple as follows for the prediction 
U(X) = Eol[Y|X] 


to [0 (Y, R)I] = Bo [9 Y, »)] — (Eo 10 (Y, w] — Eo [0 Y, u(X))1) 


+ 


ya 


£9 [0 (Y, R(X))] — Eo [0 (Y, u(X))] ). 


This expresses the expected deviance GL of the predictor A(X) as the entropy (first 
term), the conditional resolution (second term) and the conditional calibration (third 
term). The conditional resolution describes the information gain in terms of forecast 
dominance knowing the feature X, and the conditional calibration describes how 
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well we estimate (X). The conditional resolution is positive because W(X) >cx u 
and the unit deviance 0(Y, -) is a convex function, see Lemma 2.22. The conditional 
calibration is also positive, this can be seen by considering the deviance GL, 
conditional on X. 

We can reformulate this expected deviance GL in terms of the auto-calibration 


property 


te [0 (Y, @(X))] = Ea [0 (Y, w)] — (Bo [0 (Y, #)] — Eo [0 (Y, fac) 


+ (Eo [0 Y, @00)] — Eo [2 (Y, frc (X))1 ). 


The first term is the entropy, the second term is called the auto-resolution and the 
third term describes the auto-calibration. If we have an auto-calibrated forecast 
R(X) then the last term vanishes because it is equal to its balance-corrected version 
“ipc(X). Again these two latter terms are positive, for the auto-calibration this can 
be seen by considering the deviance GL, conditioned on 7i(X). 


To rectify the balance property we directly focus on (7.34), and we estimate 
this conditional expectation. That is, the balance correction can be achieved by an 
additional regression step directly estimating the balance-corrected version jigc (x) 
in (7.34). This additional regression step differs from (7.33) as it does not use the 
learned representations Z“) (x) in the last FN layer (7.32), but it uses the learned 
representations in the output layer. That is, consider the learned features 


w= (1, 43@1))", ..., Z = (1, nga)" € (1) xR, 


and perform an additional linear regression step for the response Y using the design 
matrix 


E= (Z,....B)' © R, 
This additional linear regression step gives us an estimate 
A ~ aX lis 
B= (27 V2) FTVY e R, (1.35) 


with diagonal weight matrix V = diag(vi)i<i<n. The balance property is then 
restored by estimating the balance-corrected means figc (x;) by 


figc(xi) = Bo + Bing (xi). (7.36) 


for 1 < i <n. Note that this can be done for any regression model since we do not 
rely on the network architecture in this step. 
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Remarks 7.18 


e Balance correction (7.36) may lead to some conflict in range if the dual (mean) 
parameter space M is (one-sided) bounded. Moreover, it does not consider the 
deviance loss of the response Y, but it rather underlies a Gaussian model by 
using the weighted square loss function for finding (the Gaussian MLE) £ € R?. 
Alternatively, we could consider the canonical link h that belongs to the chosen 
EDF. This then allows us to study the regression problem on the canonical scale 
by setting for the learned representations 


T T 
Zi = (LAUFE) . -- A = (LAUSE) ex0 137 
The latter motivates the consideration of a GLM under the chosen EDF 
xi +> h(fipc(xi)) = (B, 2?) = Bo + bihlug&i)), (7.38) 


for regression parameter B € R?. The choice of the canonical link and the 
inclusion of an intercept will provide the balance property when estimating 6 
with MLE, see Corollary 5.7. If the mean estimates p17 (x;) involve the canonical 
link h, (7.38) reads as 


xi > hfincli)) = (B,2f) = Bo + Bi (B, 2% xd), 


the latter scalar product is the output activation received from the FN net- 
work. From this we see that the estimated balance-corrected calibration on the 
canonical scale will give us a non-optimal (in-sample) estimation step compared 
to (7.33), if we work with the canonical link h. 

e Denuit et al. [97] give a proposal to break down the global balance to a local 
version using a suitable kernel function, this will be further discussed in the next 
Example 7.19. 


Example 7.19 (Auto-calibration in Networks) We apply this additional auto- 
calibration step (7.34) to the FN network with embedding layers that does not 
satisfy the balance property, i.e., having an average frequency of 7.24% < 7.36%, 
see Tables 7.4 and 7.5. We start by analyzing the auto-calibration property (7.34) of 
this network predictor vuș (x) by studying an empirical version of 


z b> vigchx) = E [vY |vuug(x) =z]. (7.39) 


This empirical version is obtained from the R library Loc£it [254] that allows us 
to consider a local polynomial regression fit of degree deg=2, and we use a nearest 
neighbor fraction of alpha=0 . 05, the code is provided in Listing 7.8. We use the 
exposure v scaled version in (7.39) since the balance property should hold on that 
scale, see Corollary 5.7. The claim counts are given by N = vY, and the exposure 


pa 
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v is integrated as an offset into the FN network regression function, see line 20 of 
Listing 7.4. 


Listing 7.8 Empirical auto-calibration using the R library locfit [254] 


z <- learn$pred 
mu.BC <- predict (locfit(learn$N ~ learn$pred, alpha=0.05, deg=2), newdata=z) 


Figure 7.12 (lhs) shows the empirical auto-calibration of (7.39) using the R 
code of Listing 7.8. If the auto-calibration would hold exactly, then the black 
dots should lie on the red diagonal line. We observe a very good match, which 
indicates that the auto-calibration property holds quite accurately for our network 
predictor (v,x) +> vuş(x). For very small expectations Eg(,)[N] we slightly 
underestimate, and for bigger expectations we slightly overestimate. The blue line 
shows the empirical density of the predictors vju9 (xi), 1 < i < n, highlighting 
heavy-tailedness and that the underestimation in the right tail will not substantially 
contribute to the balance property as these are only very few insurance policies. 

We explore the Gaussian balance correction (7.35) considering a linear regression 
model with weighted square loss function. We receive the estimate B = (9. 
1074, 1.005)! , thus, uşa) only gets very gently distorted, see (7.36). The results of 
this balance-corrected version fgc(x) are given on line ‘embed FN Gauss balance- 
corrected’ in Table 7.6. We observe that this approach is rather competitive leading 
to a slightly better model (out-of-sample). Figure 7.12 (rhs) shows the resulting 
(empirical) auto-calibration plot which is still not fully in line with Proposition 7.16; 
this empirical plot may be distorted by the exposures, by the fact that it is an 
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Fig. 7.12 (lhs) Empirical auto-calibration (7.39), the blue line shows the empirical density of the 
predictors vj u¢(x;), 1 < i < n; (ths) balance-corrected version using the weighted Gaussian 
correction (7.35) 
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Table 7.6 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 
of Table 5.5, the FN network model (with embedding layers of dimension b = 2), and their bias 
regularized and balance-corrected counterparts, the local correction uses a GAM with 2.6 degrees 
of freedom in the cubic spline part 


[Run |# In-sample | Out-of-sample | Aver. 

time | param. loss on £ | loss on T freq. 
Poisson null - 1 25.213 25.445 7.36% 
Poisson GLM3 15s | 50 | 24.084 24.102 7.36% 
Embed FN (q1, q2, 93) = (20, 15, 10) | 120s | 792 23.694 | 23.820 7.24% 
Embed FN bias regularized +4s |792 | 23.690 23.824 7.36% 
Embed FN Gauss balance-corrected | — 792+2 | 23.692  |23.819 7.36% 
Embed FN locally balance-corrected | — 792 + 3.6 | 23.692 23.818 7.36% 


empirical plot fitted with Locfit, and by fact that a linear Gaussian correction 
estimate may not be fully suitable. 

Denuit et al. [97] propose a local balance correction that is very much in the 
spirit of the local polynomial regression fit with Locfit. However, when using 
locfit we did not pay any attention to the balance property. Therefore, we 
proceed slightly differently, here. In formula (7.37) we give the network predictors 
on the canonical scale. This equips us with the data (Y;, vi, Z )i<i<n- To perform 
a local balance correction we fit a generalized additive model (GAM) to this data, 
using the canonical link, the Poisson deviance loss function, the observations Y;, 
the exposures v; and the feature information a for GAMs we refer to Hastie— 
Tibshirani [181, 182], Wood [384] and Chapter 3 in Wiithrich-Buser [392], in 
particular, we proceed as in Example 3.4 of the latter reference. 

The GAM regression fit on the canonical scale is illustrated in Fig. 7.13 (lhs). 
We essentially receive a straight line which says that the auto-calibration property is 
already well satisfied by the FN network predictor 47. In fact, it is not completely 
a straight line, but GCV provides an optimal model with 2.6 effective degrees of 
freedom in the natural cubic spline part. This local (GAM) balance correction leads 
to another small model improvement (out-of-sample), see last line of Table 7.6. 


Conclusion The balance property adjustment and the bias regularization are crucial 
in ensuring that the predictive model is on the right (price) level. We have pre- 
sented three sophisticated methods of balance property adjustments: the additional 
GLM step under the canonical link choice (7.33), the model-free global Gaussian 
correction (7.35)-(7.36), and the local balance correction using a GAM under the 
canonical link choice. In our example, the results of the three different approaches 
are rather similar. In the sequel, we use the additional GLM step solution (7.33), the 
reason being that under this approach we can rely on one single regression model 
that directly predicts the claims. The other two approaches need two steps to get the 
predictions, which requires the storage of two models. E 
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Fig. 7.13 (lhs) GAM fit on the canonical scale having 2.6 effective degrees of freedom (red shows 
the estimated confidence bounds); (rhs) balance-corrected version using the local GAM correction 


7.4.3 Boosting Regression Models with Network Features 


From Table 7.5 we conclude that the FN networks find systematic structure in the 
data that is not present in model Poisson GLM3, thus, the feature engineering for 
the GLM can be improved. Unfortunately, FN networks neither directly build on 
GLMs nor do they highlight the weaknesses of GLMs. In this section we discuss 
a proposal presented in Wiithrich-Merz [394] and Schelldorfer—Wiithrich [329] 
of combining two regression approaches. We are going to boost a GLM with FN 
network features. Typically, boosting is applied within the framework of regression 
trees. It goes back to the work of Valiant [362], Kearns—Valiant [209, 210], Schapire 
[328], Freund [139] and Freund-Schapire [140]. The idea behind boosting is to 
analyze the residuals of a given regression model with a second regression model 
to see whether this second regression model can still find systematic effects in the 
residuals which have not been discovered by the first one. 

We start from the GLM studied in Chap.5, and we boost this GLM with a FN 
network. Assume that both regression models act on the same feature space V C 
{1} x R®. The GLM provides a regression function for link function g and GLM 
parameter BO e Rat! 


xp pM) = ges. 


Recall that this GLM can be interpreted as a FN network of depth 0, see 
Remarks 7.2. Next, we choose a FN network of depth d > 1 with the same link 
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function g as the GLM 
xe EN) = g! uae ee), 


having a network parameter } = w, A wo), py €e R”. In particular, we 


have the FN output parameter BEN € RUt+!, we refer to Fig. 7.2. 


We blend these two regression models by combining their regression func- 
tions 


ze Meee ees Bae eee) (7.40) 


(BEM, py) = B™, wt”, (d) B € 


with parameter ® = -- Way > 


Rootitr. 


An example is provided in Fig. 7.14. It shows the FN network using embedding 
layers for the categorical variables, see also Fig.7.9 (rhs), and we add a GLM (in 
green color) that directly links the input x to the response variable. In machine 
learning this green connection is called a skip connection because it skips the FN 
layers. 


Remarks 7.20 


° Skip connections are a popular tool in network modeling, and they can be applied 
to any FN layers, i.e., a skip connection can, for instance, be added to skip the 
first FN layer. There are two benefits from skip connections. Firstly, they allow 
for more modeling flexibility, in (7.40) we directly combine a linear function 


Fig. 7.14 Illustration of the skip connection 
combined regression 

function (7.40) using a GLM al 

(in a skip connection) and a Power 

EN network VehAge 


DrivAge 


Bonus 


VehGas 


Density 
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(coming from the GLM) with a non-linear one (coming form the FN network). 
This has the flavor of a Taylor expansion to combine terms of different orders. 
Secondly, skip connections can also be beneficial for gradient descent fitting 
because the inputs have a more direct link to the outputs, and the network only 
builds the functional form around the function in the skip connection. 

e There are numerous variants of (7.40). A straightforward one is to choose a 
weight a € (0, 1) and consider the regression function 


xp uaæ)= g! fa (B9, x) + a — a) (BPN, z209 @))}. (7.41) 


The weight œ can be interpreted as the credibility assigned to the GLM. 

e Regression function (7.40) considers two intercepts aie and ve . If we do not 
consider the credibility version (7.41), one of the two intercepts is redundant. 

e This approach also allows us to learn systematic effects across different insurance 
portfolios. If we have three insurance portfolios living on the same feature space 
and if x € {1, 2,3} indicates which insurance portfolio we consider, we can 


modify the regression function (7.40) to 


3 


x) uœ, x) =g > (8. a e a] 
jal 


The indicator 1j,—j; chooses the GLM that belongs to the corresponding 
insurance portfolio x € {1, 2,3} with the (individual) GLM parameter i 
The FN network term makes them related, i.e., the GLMs of the different 
insurance portfolios interact (jointly learn) via the FN network module. This is 
the approach used in Gabrielli et al. [149] to improve the chain-ladder reserving 
method by learning across different claims reserving triangles. 


The regression function (7.40) gives the structural form of the combined 
regression model, but there is a second important ingredient proposed by Wiithrich— 
Merz [394]. Namely, the gradient descent algorithm (7.15) for model fitting can be 
started in an initial network parameter ®) e IR4%+!+” that corresponds to the MLE 


of the GLM. Denote by B”™ the MLE of the GLM part, only. 


Choose the initial value of the gradient descent algorithm for the fitting of the 
combined regression model (7.40) 


oe ale 
pO = (B, Toa e E 0) e Rot (1.42) 


that is, initially, no signals traverse the FN network part because we set BTN = 
0. 
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Remarks 7.21 


e Using the initialization (7.42), the gradient descent algorithm starts exactly in 
the optimal GLM. The algorithm then tries to improve this GLM w.r.t. the given 
loss function using the additional FN network features. If the loss substantially 
reduces during the gradient descent training, the GLM misses systematic struc- 


ture and it can be improved, otherwise the GLM is already good (enough). 


e We can declare the MLE ie to be non-trainable. In that case the original 


GLM always remains in the combined regression model and it acts as an offset. 


If we declare the MLE oo to be non-trainable, we could choose a trainable 
credibility weight a € (0, 1), see (7.41), which gradually reduces the influence 
of the GLM (if necessary). 


Implementation of the general combined regression model (7.40) can be a bit 
cumbersome, see Listing 4 in Gabrielli et al. [149], but things can substantially 
be simplified by declaring the GLM part in (7.40) as being non-trainable, i.e., 


” : 4GLM . ; x 
estimating pot by £ in the GLM, and then freeze this parameter. In view 


of (7.40) this simply means that we add an offset o; = BM xi) to the FN 


network that is treated as a prior difference between the different data points, we 
refer to Sect. 5.2.3. 


Example 7.22 (Combined GLM and FN Network) We revisit the French MTPL 
claim frequency GLM of Sect. 5.3.4, and we boost model Poisson GLM3 with FN 
network features. For the FN architecture we use the structure depicted in Fig. 7.14, 
i.e., a FN network of depth d = 3 having (q1, q2, q3) = (20, 15, 10) neurons, and 
using embedding layers of dimension b = 2 for the categorical feature components. 
Moreover, we declare the GLM part to be non-trainable which allows us to use the 
GLM as an offset in the FN network. Moreover, we apply bias regularization (7.33) 
to receive the balance property. 

The results are presented in Table 7.7. A first observation is that using model 
Poisson GLM3 as an offset reduces the run time of gradient descent fitting because 
we start the algorithm already in a reasonable model. Secondly, as expected, the 


Table 7.7 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of 
Table 5.5, the FN network model (with embedding layers of dimension b = 2), and the combined 
regression model GLM3+FN, see (7.40) 


Run |# In-sample | Out-of-sample | Aver. 

time | param. loss on £ | loss on T freq. 
Poisson null - 1 25.213 25.445 7.36% 
Poisson GLM3 15s |50 24.084 | 24.102 7.36% 
Embed FN (q1, q2, 93) = (20, 15, 10) | 120s | 792 23.694 23.820 7.24% 
Embed FN bias regularized +4s | 792 23.690 23.824 | 7.36% 
Combined GLM+FN (20, 15, 10) +538 |50 +792 | 23.772 | 23.834 7.24% 
Combined GLM-+FN bias regularized 4s |50+ 792 | 23.765 23.830 7.36% 
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FN features decrease the loss of model Poisson GLM3, this indicates that there 
are systematic effects that are not captured by the GLM. The final combined and 
regularized model has roughly the same out-of-sample loss as the corresponding 
FN network, showing that this approach can be beneficial in run times, and the 
predictive power is similar to a pure FN network. a 


Example 7.23 (Improving Model Poisson GLM3) In this example we would like to 
explore the deficiencies of model Poisson GLM3 by boosting it with FN network 
features. We do this in a systematic way by only considering two (continuous) 
features components at a time in the FN network. That is, we consider the combined 
approach (7.40) with initialization (7.42), but as feature information for the network 
part, we only consider two components at a time. For instance, we start with the 
features (1, Area, VehPower) € {1} x R? for the network part, and the remaining 
feature information is ignored in this step. This way we can test whether the 
marginal modeling of Area and VehPower is suitable in model Poisson GLM3, 
and whether a pairwise interaction in these two components is missing. We train 
this FN network starting from model Poisson GLM3 (and keeping this GLM part 
frozen). The decrease in the out-of-sample loss during the gradient descent training 
is shown in Fig. 7.15 (top-left). We observe that the loss remains rather constant over 
100 training epochs. This tells us that the pair (Area, VehPower) is appropriately 
considered in model Poisson GLM3. 

Figure 7.15 gives all pairwise plots of the continuous feature components Area, 
VehPower, VehAge, DrivAge, BonusMalus, Density, the scale on the y- 
axis is identical in all plots. We observe that only the plots including the variable 
BonusMalus provide a bigger decrease in loss (in blue color in the colored 
version). This indicates that mainly this feature component is not modeled optimally 
in model Poisson GLM3, because boosting with a FN network finds systematic 
structure here that improves the loss of model Poisson GLM3. In model Poisson 
GLM3, the variable BonusMalus has been modeled log-linearly with an interac- 
tion term with DrivAge and (DrivAge)’, see (5.35). Table 7.8 shows the result 
if we add a FN network feature (7.40) for the pair (DrivAge, BonusMalus) 
to model Poisson GLM3. Indeed, we see that the resulting combined GLM-FN 
network model has the same GL as the full FN network approach. Thus, we 
conclude that model Poisson GLM3 performs fairly well and only the modeling 
of the pair (DrivAge, BonusMalus) should be improved. a 


7.4.4 Network Ensemble Learning 


Ensemble learning is a popular way of expressing that one takes an average over 
different predictors. There are many established methods that belong to the family of 
ensemble learning, e.g., there is boostrap aggregating (called bagging) introduced 
by Breiman [51], there are random forests, and there is boosting. Random forests 


320 


Area vs. VehPower 


Area vs. VehAge 


§ aanre Sa ar i 


Area vs. DrivAge 


6 Res PP ea abe 


Area vs. BonusMalus 


7 Deep Learning 


Area vs. Density 


VehPower vs. VehAge 
A 


e | cdi PONS tails 


VehPower vs. DrivAge 


VehPower vs. BonusMalus 
3 


VehPower vs. Density 


VehAge vs. DrivAge 


VehAge vs. Density 


VehAge vs. BonusMalus 7 


4 | ammen a § | acta Papi enters? 


DrivAge vs. BonusMalus 


DrivAge vs. Density 


BonusMalus vs. Density 


Fig. 7.15 Exploring all pairwise interactions: out-of-sample losses over 100 gradient descent 
epochs for all pairs of the continuous feature components Area, VehPower, VehAge, 
DrivAge, BonusMalus, Density (the scale on the y-axis is identical in all plots) 


and boosting are mainly based on classification and regression trees (CARTs) and 
they belong to the most powerful machine learning methods for tabular data. These 
methods combine a family of predictors to a more powerful predictor. The present 
section is inspired by the bagging method of Breiman [51], and we perform network 
aggregating (called nagging). 


Stochastic Gradient Descent Fitting of Networks 
We have described that network calibration involves several elements of random- 


ness. This in combination with early stopping leads to the non-uniqueness of 
reasonably good networks for prediction and pricing. We have discussed this based 


7.4 Special Features in Networks 321 


Table 7.8 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 
of Table 5.5, model Poisson GLM3 with additional FN features for (DrivAge, BonusMalus), 
the FN network model (with embedding layers of dimension b = 2), and the combined regression 
model GLM3+FN, see (7.40) 


[Run |# In-sample | Out-of-sample | Aver. 

time | param. loss on £ | loss on T freq. 
Poisson null - 1 25.213 25.445 | 7.36% 
Poisson GLM3 15s | 50 | 24.084 24.102 7.36% 
GLM3 +FN(DrivAge,BonusMalus)|— | 50 +792 | 23.804 |23.805 | 7.36% 
Embed FN bias regularized 124s | 792 23.690 23.824 7.36% 
Combined GLM+FN bias regularized = | 72s |50 + 792 | 23.765 23.830 7.36% 


on Fig.7.5, namely, for a given network architecture we have a continuum of 
comparably good models (w.r.t. the chosen objective function) that lie in the green 
area of Fig. 7.5. One SGD calibration picks one specific model from this green area, 
we also refer to Remarks 7.9. Of course, this is very unsatisfactory in insurance 
pricing because it implies that the selection of a price for an insurance policy has 
a substantial element of subjectivity (that cannot be explained to the customer). 
Naturally, we would like to combine models in the green area of Fig.7.5, for 
instance, by performing some sort of integration over the models in the green area. 
Intuitively, this should lead to a very powerful predictive model because it diversifies 
the weaknesses of each individual model. This is exactly what we discuss in this 
section. Before doing so, we would first like to understand the different single 
calibrations of a given network architecture. 

We consider the MTPL data of Example 7.12. We model this data with a Poisson 
FN network using embedding layers for the categorical features and using bias 
regularization (7.33) to guarantee the balance property to hold. For the FN network 
architecture we choose depth d = 3 with (q1, g2, g3) = (20, 15, 10) FN neurons; 
this setup gives us the results on the last line of Table 7.5. We now repeat this 
procedure M = 1/600 times, using exactly the same FN network architecture, the 
same early stopping strategy, the same SGD method and the same batch size. We 
only change the seeds of the starting point 8 € R” of the SGD algorithm, the 
partitioning of the learning data £ into training data U/ and validation data V, see 
Fig. 7.7, and the partitioning of the training data into the (mini-)batches. 

The resulting 1’600 in-sample and out-of-sample deviance losses are presented 
in Fig. 7.16. We observe a considerable variation in these figures. The in-sample 
losses vary between 23.616 and 23.815 (mean 23.728), and the corresponding out- 
of-sample loss between 23.766 and 23.899 (mean 23.819), units are in 10-7; note 
that all network calibrations are bias regularized. The in-sample loss is an average 
over n = 610/206 (individual) unit deviance losses, and the out-of-sample an 
average over T = 67'801 unit deviance losses, see also Definition 4.24. Therefore, 
we expect an even much bigger variation on individual insurance policies. We are 
going to analyze this in more detail in this section. 
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Fig. 7.16 Boxplots over 1'600 network calibrations only differing in the seeds for the SGD 
algorithm and the partitioning of the learning data: (lhs) in-sample losses on £ and (rhs) out- 
of-sample losses on 7, the horizontal lines show the calibration chosen in Table 7.5; units are in 
107? 


Before doing so, we would like to understand whether there is some dependence 
between the in-sample and the out-of-sample losses over the M = 1600 runs of 
the SGD algorithm with different seeds. In Fig. 7.17 we provide a scatter plot of 
the out-of-sample losses vs. the in-sample losses. This plot is complemented by 
a cubic spline regression (in orange color). From this plot we conclude that the 
models with very small in-sample losses tend to over-fit, and the models with large 
in-sample losses tend to under-fit (always using the same early stopping rule). In 
view of these results we conclude that the chosen early stopping rule is sensible 
because on average it tends to provide the model with the smallest out-of-sample 
loss on 7. Recall that we do not use 7 during the SGD fitting, but only the learning 
data £ that is split into the training data U/ and the validation data V for exercising 
the early stopping, see Fig. 7.7. 


Fig. 7.17 Scatter plot of scatter plot of out-of-sample vs. in-sample losses 
out-of-sample losses 

vs. in-sample losses for 
different seeds, the orange 
line gives a fitted cubic 
spline, and the cyan lines 
show the empirical means; 
units are in 107? 
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Next, we study the estimated prices on the test data (out-of-sample) 
= [oi = N} Jv, x] v t=1,.. T= 67'801}. 


For each run of the SGD algorithm we receive a different (early stopped) network 
parameter estimate P” e R , 1 < m < M = 1600. Using these parameter 
estimates we receive the estimated network regression functions, for 1 < m < M, 


x > R) = up (a), 


using the FN network of Listing 7.4 with network parameter P". Thus, for the out- 
of-sample policies 1 < t < T we receive the expected frequencies 


xi Ss mr = m” (xi) = HF" (=i) ‘ 


Since we choose the seeds of the SGD runs at random we may (and will) assume 
that we have independence between the prices (/7)"),;<7 of the different runs 1 < 
m < M of the SGD algorithm. This allows us to estimate the average price and the 
coefficient of variation of these prices of a fixed insurance policy t over the different 
SGD runs 


a gem a m” and Veco; = 


1 
-(1:M) 
M z 1 Ht 


(7.43) 
These (out-of-sample) coefficients of variation are illustrated in Fig.7.18. We 
observe a considerable variation on some policies. The average coefficient of 
variation is roughly 10% (orange horizontal line, Ihs). The maximal coefficient of 
variation is about 40%, thus, for this policy the individual prices 7/” of the different 
SGD runs 1 < m < M fluctuate considerably around al ee This now explains 
why we choose M = 1'600 SGD runs, namely, the averaging in (7.43) reduces the 
coefficient of variation on this policy to 40%/WM = 40%/40 = 1%, note that we 
have independence between the different SGD runs. Thus, by averaging we receive 
an acceptable influence of the variation of the individual SGD fittings. 

Listing 7.9 shows the 10 policies (out-of-sample) with the largest coefficients 
of variations Vco;. These polices have in common that they belong to the lowest 
BonusMalus level, the drivers are very young, the cars are comparably old and 
they have a bigger vehicle power. From a practical point of view we should doubt 
these policies, since the information provided may not be correct. New drivers (at 
the age of 18) typically enter a bonus-malus scheme at level 100, and only after 
several accident-free years these drivers can reach a bonus-malus level of 50. Thus, 
policies as in Listing 7.9 should not exist, and our pricing framework has difficulties 
to (correctly) handle them. In practice, this needs further investigation because, 
obviously, there is a data issue, here. 
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Fig. 7.18 Out-of-sample coefficients of variations Vco; on an individual policy level 1 < t < T 


over the 1'600 calibrations (lhs) scatter plot against the average estimated frequencies pe and 
(ths) resulting histogram 


Listing 7.9 The 10 policies (out-of-sample) with the largest coefficients of variation 


Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Region vco 
D 8 6 18 50 B11 Regular R53 0.4089006 
D 9 T 20 50 B11 Regular R24 0.3827665 
C 8 L 8 50 B5 Regular R24 0.3762306 
S 9 8 18 50 BS Regular R24 0.3697370 
Cc 7 7 8 50 B1 Regular R24 0.3579979 
C 9 9 9 50 BS Regular R24 0.3554879 
ic 6 5 20 50 B1 Regular R93 0.3528679 
Cc 7 4 9 50 B1 Regular R53 0.3518279 
A 11 20 50 50 B13 Regular R74 0.3442184 
D 5 4 18 50 B3 Diesel R24 0.3403783 

Nagging Predictor 


The previously observed variations of the prices motivate to average over the 
different models (network calibrations). This brings us to bagging introduced by 
Breiman [51]. Bagging is based on averaging/aggregating over several ‘indepen- 
dent’ predictions; this is done in three steps. In a first step, a model is fitted to the 
data £. In a second step, independent bootstrap samples £*“”) are generated from 
this fitted model; the independence has to be understood in a conditional sense, 
namely, the different bootstrap samples L*0™ are independent in m, given the data 
L. In the third step, for every bootstrap sample £L*0™ one estimates a model Q”, 
and averaging (7.43) provides the bagging predictor. Bagging is mainly a variance 
reduction technique. Note that if the fitted model of the first step has a bias, then 
likely the bootstrap samples £*”) are biased, and so is the bagging predictor. 
Therefore, bagging does not help to reduce a potential bias. All these results have to 
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be understood conditionally on the data £. If this data is atypical for the problem, 
so will the bootstrap samples be. 

We can perform a similar analysis for the fitted networks, but we do not need to 
bootstrap, here, because the various elements of randomness in SGD fitting allow us 
to generate independent predictors jz”, conditional on the data £. Averaging (7.43) 
over these predictors then provides us with the network aggregating (nagging) 
predictor pe. we also refer to Dietterich [105] and Richman—Wiithrich [315] 
for this aggregation. Thus, we replace the bootstrap step by the different runs of 
the SGD algorithm. Both options provide independent predictors 2”, conditional 
on the data £. However, there is a fundamental difference between bagging and 
nagging. Bagging generates new (bootstrap) samples £*“” and, thus, bagging also 
involves randomness coming from sampling the new observations. Nagging always 
acts on the same sample £, and it only refits the model multiple times. Therefore, 
the latter will typically introduce less variation. Of course, bagging and nagging can 
be combined, and then the full expected GL can be estimated, we come back to this 
in Sect. 11.4, below. We do not sample new observations, here, because we would 
like to understand the variations implied by the SGD algorithm with early stopping 
on the given (fixed) data. 

In Fig. 7.18 we have seen that we need nagging over 1’600 network calibrations 
so that the maximal coefficient of variation on an individual policy level is below 
1% in our MTPL example. In this section we would like to understand the minimal 
out-of-sample loss that can be achieved by nagging on the (entire) test data set, and 
we would like to analyze its rate of convergence. 


For this we define the sequence of nagging predictors 


M 
1 A 
aM) = = 5 o for M > 1. (7.44) 
m= 


This allows us to study the out-of-sample losses on 7 in the Poisson model for 
M>1 


T =(1: + 
on 2 s EEM (xt) 
DT, BM) = = Yo} (axe )— Yj — Yog (e 
t=1 


t 


Remark 7.24 From Remarks 7.17 we know that the expected deviance GL of 
the estimated model is lower bounded by the expected deviance GL of the true 
data generating model; the difference is the conditional calibration. Within the 
family of Tweedie’s CP models Richman-Wüthrich [315] proved that, indeed, 
aggregating decreases monotonically the expected deviance GL of the estimated 
model (Proposition 2 of [315]), convergence is established (Proposition 3 of [315]), 
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and the speed of convergence is provided using asymptotic normality (Proposition 
4 of [315]). For the Gaussian square loss results we refer to Breiman [51] and 
Bühlmann-Yu [60]. 


We revisit Proposition 2 of Richman—Wiithrich [315] which has also been proved 
in Proposition 3.1 of Denuit—Trufin [103]. We only consider a single case in the next 
proposition and we drop the feature information x (because we can condition on 
X =x). 


Proposition 7.25 Choose a response Y ~ f (-; 0, v/p) belonging to Tweedie’s CP 
model having a power variance cumulant function k = Kp with power variance 
parameter p € [1, 2], see (2.17). Assume {i is an estimator for the mean parameter 
u = K, (0) > O satisfying € < T < p/(p— Vp, a.s., for some € € (0, p/(p— 1). 
Choose i.i.d. copies Q”, m > 1, of tt being all independent of Y. We have for all 
M>1 


ta [o (X, R") | = Eo [o (Y, a") | = Eo [o (ya?) | = Eo 0, w1. 


Proof of Proposition 7.25 The lower bound on the right-hand side immediately 
follows from Theorem 4.19. For an estimate @ > 0 we define the function, we 
also refer to (4.18) and we set for the canonical link hp = CA ae 


plog(i) -A for p= 1, 
ae 


fh +> yp @) = bhp (Bi) — Kp (hp @) = | “ES - E for p € (1,2), 
—u/ù — log(f) for p = 2. 


This is the part of the log-likelihood (and deviance loss) that depends on the 
canonical parameter 0 = hp (f), and replacing the observation Y by u. Calculating 
the second derivative w.r.t. £ provides for p € [1, 2] 


a 


3? p ORN a = 
pane = -pui P-1_(1— pa? =f "tP) [-pu — (1 — pñ] < 0, 


the last inequality uses that the square bracket is non-positive, a.s., under our 
assumptions on f. Thus, Yp is concave on the interval (0, p/(p — 1)u). We now 
focus on the inequalities for M > 1. Consider the decomposition of the nagging 
predictor for M + 1 


| MH M+1 
= (1:M+1) _ 


‘ f 1 
z0) = qin : 
m P: h : where h = y u Limzj}- 
I= m= 
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The predictors T”, j > 1, are copies of A, though not independent ones. 
Using the function yp, the second term on the right-hand side has the same structure 
as the estimation risk function (4.14), 
By [ocy, 2] 
= By [ocr AOM+D)] + 2 Eg [Php (AEM) — xp (hp (ATM) 
-2y [rr (2) by (0) 
= [por ae] +a (e| (a) 2 fv) 


1 M+1 ; 
(aris pee -va (2) 


yp l 
tai +2 | PESI 5 Yp (ae) -E[vp (a) 


= Ey [ara], 


= E [ocy, MY) | 42 (1 


fl 
S 
v 


P 


IV 
D 
Tey 
= 
os 
EL 
z 


the second last step applies Jensen’s inequality to the concave function wp, and the 
last step follows from the fact that B©), j > 1, are copies of pO). o 


Remarks 7.26 


e Proposition 7.25 says that aggregation works, i.e., aggregating i.i.d. predictors 
leads to monotonically decreasing expected deviance GLs. In fact, if @ < 2p, 
a.s., we receive Tweedie’s forecast dominance by aggregating, restricted to the 
power variance parameters p € [1, 2], see Definition 4.22. 

e The i.i.d. assumption can be relaxed, indeed, it is sufficient that every p} 
in the above proof has the same distribution as 7“. This does not require 
independence between the predictors 7”, m > 1, but exchangeability is 
sufficient. 

e We need the condition € < fi < p/(p — 1)p, a.s., to ensure the monotonicity 
within Tweedie’s CP models. For the Poisson model p = 1 we can drop the 
upper bound, and we only need the lower bound to ensure the existence of the 
expected deviance GL. For p € (1, 2] the upper bound is increasingly binding, 
in the gamma case p = 2 requiring f < 2, a.s. 

e Note that we do not require unbiasedness of f for u in Proposition 7.25. Thus, 
at this stage, aggregating is a variance reduction technique. 
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Fig. 7.19 Out-of-sample nagging predictors for M>=1 
losses D(T, ji“) of the 
nagging predictors 

—(1:M) yt 
(u (x; Di<t<T for 
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e If additionally we have unbiasedness of f for u and a uniformly integrable upper 
bound on j“!), we can use Lebesgue’s dominated convergence theorem and the 
law of large numbers to prove 


lim Eg [o (x pe) | = a| lim o (x, a] = Ep [0(Y, u)]. 


M->oo M->oo 
(7.45) 


The uniformly integrable upper bound is only needed in the Poisson case p = 1, 
because the other cases are covered by € <  < p/(p — 1), a.s. Moreover, 
asymptotic normality can be established, we refer to Proposition 4 in Richman- 
Wiithrich [315]. 


We come back to our MTPL Poisson claim frequency example and its 1’600 
network calibrations illustrated in Fig. 7.17. Figure 7.19 provides the out-of-sample 
portfolio losses D(T, aC) of the resulting nagging predictors (uC iarr 
for 1 < M < 40 in red color, and the corresponding 1 standard deviation confidence 
bounds in orange color. The blue horizontal dotted line shows the case M = 1 
which exactly refers to the (first) bias regularized FN network R=! with embedding 
layers given in Table 7.5. Indeed, averaging over multiple networks improves the 
predictive model and the out-of-sample loss decreases over the first 2 < M < 10 
nagging steps. After the first 10 steps the picture starts to stabilize which indicates 
that for this size of portfolio (and this type of problem) we need to average over 
roughly 10-20 FN networks to receive optimal predictive models on the portfolio 
level. For M — oo the out-of-sample loss converges to the green horizontal dotted 
line in Fig. 7.19 of 23.783 - 1077. These numbers are also reported on the last line 
of Table 7.9. 

Figure 7.20 provides the empirical auto-calibration property (7.39) of the 
nagging predictor j‘!'!©°): this is obtained completely analogously to Fig. 7.12. 
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Table 7.9 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 of 
Table 5.5, the FN network models (with embedding layers of dimension b = 2), and the nagging 
predictor for M = 1'600 


Run | # In-sample | Out-of-sample | Aver. 
time | param. |losson£ | loss on 7 freq. 
Poisson null |- 1 25.213 25.445 | 7.36% 
Poisson GLM3 15s |50 24.084 24.102 7.36% 
Embed FN bias regularized @”=! | +4s |792 23.690 23.824 7.36% 
Average over 1/600 SGDs (Fig. 7.16) |- 792 23.728 23.819 7.36% 
Nagging EN A), M = 1'600 oo |792  |23.691 23.783 [7.36% 
Fig. 7.20 Empirical auto-calibration of nagging predictor 
auto-calibration (7.39) of the é 


Poisson nagging predictor, 
the blue line shows the 
empirical density of 
vp OO) (el <isn 
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fi 
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T if T ji 

0.0 0.1 0.2 0.3 0.4 0.5 


estimated claims v*mu(x) 


Ho 


The nagging predictors are (already) bias regularized, and Fig. 7.20 supports that 
the auto-calibration property holds rather accurately. 

At this stage, we have fully arrived at Breiman’s [53] two modeling cultures 
dilemma, see also Sect. 1.1. We have started from a parametric data model, and 
in order to boost its predictive performance we have combined such models in 
an algorithmic way. Working with many blended networks is not really practical, 
therefore, in such situations, a meta model can be fitted to the resulting nagging 
predictor. 


Meta Model 
Since working with M = 1'600 different FN networks is not practical, we fit a meta 


model to the nagging predictors ju!" (-). This can easily be done by just selecting 
an additional FN network and fit this additional network to the working data 


D* = | Caer E vi) fis Lda U [ROP a, vi) a Asians? fs 
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Table 7.10 Run times, number of parameters, in-sample and out-of-sample deviance losses (units 
are in 107°) and in-sample average frequency of the Poisson null model, model Poisson GLM3 
of Table 5.5, the FN network model (with embedding layers of dimension b = 2), the nagging 
predictor, and the meta network model 


Run |# In-sample Out-of-sample | Aver. 

time | param. |losson£ | loss on T freq. 
Poisson null = 1 25.213 25.445 7.36% 
Poisson GLM3 15s | 50 24.084 24.102 7.36% 
Embed FN bias regularized "=! |+4s | 792 23.690 23.824 | 7.36% 
Nagging FN gC) oo |792? [23.691 23.783 7.36% 
Meta FN network 72" = 792 23.714 23.777 7.36% 


For this calibration step we can consider all data, since we would like to fit a 
regression model as accurately as possible to the entire regression surface formed by 
all nagging predictors from the learning and the test data sets £ and 7. Moreover, 
this step should not over-fit since this regression surface of nagging predictors 
does not include any noise, but it is on the level of expected values. As network 
architecture we choose again the same FN network of depth d = 3. The only 
change to the fitting procedure above is replacing the Poisson deviance loss by the 
square loss function, since we do not work with the Poisson responses N; but rather 
with their mean estimates gin )(x;) and a ) (x!) in this fitting step. Since the 
resulting meta network model may still have a bias we apply the bias regularization 
step of Listing 7.7 to the Poisson observations with the Poisson deviance loss on the 
learning data £ (only). The results are presented in Table 7.10. 

From these results we observe that in our case the meta network performs 
similarly well to the nagging predictor, and it seems to be a very reasonable choice. 

Finally, in Fig. 7.21 (lhs) we analyze the resulting frequencies on an individual 
policy level on the test data set T. We plot the estimated frequencies j=! (x!) of 
the first FN network (this corresponds to ‘embed FN bias regularized’ in Table 7.10 
with an out-of-sample loss of 23.824) against the nagging predictor jv“) (x!) 
which averages over M = 1/600 networks. From Fig.7.21 (lhs) we conclude 
that there are quite some differences between these two predictors, this exactly 
reflects the variations obtained in Fig. 7.18 (Ihs). The nagging predictor removes this 
variation by averaging. Figure 7.21 (rhs) compares the nagging predictor ja“) (x!) 
to the one of the meta model 77™*" (x!). This scatter plot shows that the predictors 
lie almost perfectly on the diagonal line which suggests that the meta model can be 
used as a substitute for the nagging predictor. This completes this claim frequency 
modeling example. 


Remark 7.27 The meta model concept can also be useful in other situations. For 
instance, we can fit a gradient boosting regression model to the observations. 
Typically, this is much faster than calculating a nagging predictor (because it directly 
focuses on the weaknesses of the existing model). If the gradient boosting model 
is based on regression trees, it has the disadvantage that the resulting regression 
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Fig. 7.21 Scatter plot of the out-of-sample predictions 72”"=! (x}), am (x}) and jumeta (x!) over 
all polices 1 < t < T on the test data set 7: (lhs) Ti (x}) vs. ASEA (x!) and (rhs) pmeta (xt) 
ys. EM) (x}); the color scale shows the exposures uf e (0, 1] 


function is not continuous, and a non-constant extrapolation might be an issue. 
In a second step we can fit a meta FN network model to the former regression 
model, lifting the boosting model to a smooth network that allows for a non-constant 
extrapolation. 


Example 7.28 (Gamma Claim Size Modeling) We revisit the gamma claim size 
example of Sect. 5.3.7. The data comprises Swedish motorcycle claim amounts. We 
have seen that this claim size data is not heavy-tailed, thus, a gamma distribution 
may be a reasonable choice for this data. For the modeling of this data we use the 
same normalization is in (5.45), this parametrization does not require the explicit 
knowledge of the (constant) shape parameter of the gamma distribution for mean 
estimation. 

The difficulty with this data is that only 656 insurance policies suffer a claim, 
and likely a single FN network will not lead to stable results in this example. 
As FN network architecture we again choose a network of depth d = 3 and 
with (q1, 92,93) = (20,15, 10) neurons. Since the input layer has dimension 
qo = 1 + 6 = 7 we receive a network parameter of dimension r = 626. As loss 
function we choose the gamma deviance loss, see Table 4.1. Moreover, we choose 
the nadam optimizer, a batch size of 300, a training-validation split of 8:2, and we 
retrieve the network calibration with the lowest validation loss with a callback. 

Figure 7.22 shows the results of 1000 different SGD runs (only differing in the 
initial seeds and the splits of the training-validation sets as well as the batches). 
We see a considerable variation between the different SGD runs, both in in-sample 
deviance losses but also in the average estimated claims. Note that we did not bias- 
regularize the resulting networks (we work with the log-link here which is not the 
canonical one). This is why we receive fluctuating portfolio averages in Fig. 7.22 
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Fig. 7.22 Boxplots over 1/000 network calibrations only differing in the seeds for the SGD 
algorithm and the partitioning of the learning-validation data: (lhs) in-sample losses on the (entire) 
data £ and (rhs) average estimated claims 
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Fig. 7.23 Coefficients of variations Vco; on an individual claim level 1 < i < n over the 1'000 
calibrations (lhs) scatter plot against the nagging predictor 4“) (x;) and (rhs) histogram 


(ths), the red line illustrates the empirical mean. Obviously, these FN networks are 
(on average) positively biased, and they will need a bias correction for the final 
prediction. 

Figure 7.23 analyzes the variations on an individual claim level by studying 
the in-sample version of the coefficient of variation given in (7.43). We see that 
these coefficients of variation are bigger than in the claim frequency example, see 
Fig. 7.18. Thus, to receive stable results the nagging predictors ju“ (x;) have to be 
calculated over many networks. Figure 7.24 confirms that aggregating reduces (in- 
sample) losses also in this case. From this figure we also see that the convergence is 
slower compared to the MTPL frequency example of Fig. 7.19, of course, because 
we have a much smaller claims portfolio. 
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Fig. 7.24 In-sample losses 
D(L, M) of the nagging 
predictors (aED (Xi) )1<i<n 
for 1 < M < 40 on the 
motorcycle claim size data 
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Table 7.11 Number of parameters, Pearson’s dispersion estimate, MLE dispersion estimate, in- 
sample losses and in-sample average claim amounts of the null model (gamma intercept model), 


the gamma GLMs and the network nagging predictor; for the GLMs we refer to Table 5.13 


|# Dispersion In-sample | Average 

| param. [9 [ME |lossong | amount 
Gamma null 1+1 [2.057 |1.690 | 2.085 24641 
Gamma GLM1 9+1 | 1.537 | 1.426 |1.717 25’ 105 
Gamma GLM2 [7+1 | 1.544 |1.427 |1.719 25'130 
Gamma FN network nagging (626+1 |- = | 1.478 | 26°387 
Gamma FN network nagging (bias reg) |626+1 | 1.050 | 1.240 | 1.465 | 24641 


Table 7.11 presents the results if we take the nagging predictor over 1’000 
different networks. The first observation is that we receive a much smaller in-sample 
loss compared to the GLMs, thus, there seems to be much room for improvements in 
the GLMs. Secondly, the nagging predictor has a substantial bias. For this reason we 
shift the intercept parameter in the output layer so that the portfolio average of the 
nagging predictor is equal to the empirical mean, see the last column of Table 7.11. 

A main difficulty in this model is the estimation of the dispersion parameter 
gy > 0 and the shape parameter œ = 1/g of the gamma distribution, respectively. 
Pearson’s dispersion estimate does not work because we do not know the degrees 
of freedom of the nagging predictor, see also (5.49). In Table 7.11 we calculate 
Pearson’s dispersion estimate by simply dividing by the number of observations; 
this should be understood as a lower bound; this number is highlighted in italic. 
Alternatively, we can calculate the MLE, however, this may be rather different from 
Pearson’s estimate, as indicated in Table 7.11. Figure 7.25 (lhs) shows the resulting 
QQ plot of the nagging predictor if we use the MLE ME = 1.240, and the right- 
hand side shows the same plot for ¢ = 1.050. From these plots it seems that we 
should rather go for a smaller dispersion parameter, the MLE being probably too 
much dominated by the small claims. This observation should also be understood as 
ared flag, as it tells us that the chosen gamma model is not fully suitable. This may 
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Fig. 7.25 QQ plots of the nagging predictors against the gamma density with (Ihs) @M-E = 1.240 
and (rhs) ¢ = 1.050 
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Fig. 7.26 (lhs) Scatter plot of model Gamma GLM2 predictors against the nagging predictors 
EM (x;) over all instances 1 < i < n, (rhs) scatter plot of two (independent) nagging predictors 


be for various reasons: (1) the dispersion is not constant and should be modeled 
policy dependent, (2) the features are not sufficient to explain the observations, 
or (3) the gamma distribution is not suitable and should be replaced by another 
distribution. 

In Fig.7.26 (lhs) we compare the predictions received from model Gamma 
GLM2 against the nagging predictors aM) (x;) over all instances 1 < i < n. 
The scatter plot spreads quite wildly around the diagonal which seriously questions 
at least one of the two models. To ensure that this variability between the two models 
is not caused by the (complex) FN network architecture, we verify the nagging 
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Fig. 7.27 Empirical auto-calibration of gamma predictor 
auto-calibration (7.39) of the S 
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predictor aM), M = 1/000, by computing a second independent one. Indeed, 
Fig. 7.26 shows that these two independent nagging predictors come to the same 
conclusion on the individual instance level. Thus, the network finds/uses systematic 
effects that are not present in model Gamma GLM2. If we perform a pairwise 
interaction analysis for boosting the GLM as in Example 7.23, we find that we 
should add interactions to the GLM between (VehAge, RiskClass), (VehAge, 
BonusClass), (OwnerAge, Area), and (OwnerAge, VehAge); recall that 
model Gamma GLM2 neither includes BonusClass nor Gender as supported 
by a drop1 backward elimination analysis from model Gamma GLM1. However, 
it turns out, here, that we should have BonusClass in the model by letting it 
interact with VehAge. 

Finally, Fig. 7.27 shows the empirical auto-calibration behavior (7.39) of the 
Gamma FN network nagging predictor of Table 7.11. The resulting black dots are 
rather volatile which shows that we do not (fully) have the auto-calibration property, 
here, but it also expresses that we fit a model on only 656 claims. The prediction 
of these claims is highlighted by the blue empirical density given by a) (x;), 
1 < i < n. On the positive side, the auto-calibration plot shows that we neither 
systematically under- nor over-estimate because the black dots fluctuate around the 
diagonal red line, only the upper tail seems to under-estimate the true claim size. m 


Ensembling over Selected Networks vs. All Networks 


Zhou et al. [406] ask the question whether ensembling over ‘selected’ networks is 
better than ensembling over all networks. In their proposal they introduce a weighted 
averaging scheme over the different network predictors 7", 1 < m < M. We 
perform a slightly different analysis here. We are re-using the M = 1/600 SGD 
calibrations of the Poisson FN network illustrated in Fig. 7.17. We order these SGD 
calibrations w.r.t. their in-sample losses D(L, jz”), 1 < m < M, and partition this 
ordered sample into three equally sized sets: the first one containing the smallest 
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Fig. 7.28 Empirical density empirical density of in-sample losses 
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Fig. 7.17 
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in-sample losses, the second one the middle sized in-sample losses, and the third 
one the largest in-sample losses. Figure 7.28 shows the empirical density of these 
in-sample losses, and the vertical lines give the partition into the three sets, we call 
the resulting (disjoint) index sets Z8™!!, middle Tlarse C {1,,.., M}. Remark that 
this partition is done fully in-sample, based on the learning data £, only. 

We then consider the nagging predictors on each of these index sets separately, 
i.e., 


1 
— small E ~m 
WN) = mam LB, 


mezsmall 
. 1 
~ middl A 
A) = ae 2 AO, (7.46) 
memiddle 
1 
~ large _ ~m 
MPO = rae >. a): 
me large 


If we believe into the orange cubic spline in Fig. 7.17, the middle nagging predictor 
jmidde should out-perform the other two nagging predictors. Indeed, this is the case, 


here. We receive the out-of-sample losses (in 1072) on the three subsets 


DT, oo") = 23.784, DT, amd’) = 23.272, 9 D(T, p) = 23.782. 
(7.47) 


This approach boosts by far any other approach considered, see Table 7.10; note that 
this analysis relies on a fully proper in-sample and out-of-sample testing strategy. 
Moreover, this also supports our early stopping strategy because, obviously, the 
optimal networks are centered around our early stopping rule. How does this result 
match Proposition 7.25 saying that the nagging predictor has a monotonically 
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Fig. 7.29 Scatter plot of the claims frequency prediction (log-scale) 
nagging predictors 7 
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decreasing deviance loss. For the convergence (7.45) we need unbiasedness, 
and (7.47) indicates that averaging over all M network calibrations results in biases 
on an individual policy level; on the aggregate portfolio level, we have applied the 
bias regularization step (7.33), but this does not act on an individual policy level. 
The latter would require a local balance correction similar to the GAM approach 
presented in Example 7.19. 

Figure 7.29 is truly striking! It compares the nagging predictors pM) (ef) 
to the ones jumiddle 1) only using the calibrations m € T™4le, i.e., only using 
the calibrations with middle sized in-sample losses. The different colors show the 
exposures vi € (0, 1]. We observe that only portfolios with short exposures do not 
lie on the diagonal line. Thus, there seems to be an issue with insurance policies 
with short exposures. Recall that we model the Poisson claim counts N; using the 
assumption, see (5.27), 


Ni ~ Poi(viu(xi)). (7.48) 


That is, the expected claim count Ee, [N;] = v;u(x;) is assumed to scale 
proportionally in the exposure v; > 0. Figure 7.29 raises some doubts whether this 
is really the case, or at least SGD fitting has some difficulties to assess the expected 
frequencies u(x;) on the policies i with short exposures v; > 0. We discuss this 
further in the next subsection. Table 7.12 gives a summary of our results. 


Analysis of Over-dispersion 


With all the excitement of Fig.7.29, the above models do not fit the observations 
since the over-dispersion is too large, see the last column of Table 7.12. This has 
motivated the study of the negative binomial model in Sect. 5.3.5, the ZIP model in 
Sect. 5.3.6, and the hurdle Poisson model in Example 6.19. These models have led 
to an improvement in terms of AIC, see Table 6.6. We could go down the same 
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Table 7.12 Number of parameters, in-sample and out-of-sample deviance losses (units are in 
1077), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson 
GLM3 of Table 5.5, the FN network model (with embedding layers of dimension b = 2), the 
nagging predictor, the meta network model, and the middle nagging predictor 


# In-sample Out-of-sample Aver. Disp. 

param. |losson£_ | loss on 7 freq. pE 
Poisson null 1 25.213 25.445 7.36% | 1.7160 
Poisson GLM3 50 24.084 24.102 7.36% | 1.6644 
Embed FN bias regularized "=! |792 23.690 23.824 7.36% |1.6812 
Nagging FN ji“) ‘9X [23.691 [23.783 7.36% | 1.6592 
Meta FN network fme! 792 23.714 | 23.777 [7.36% | 1.6737 
Middle nagging FN j™ dle ‘792’ | 23.698 23.272 7.36% | 1.6618 


route here by substituting the Poisson model. We refrain from doing so, as we 
want to further analyze the Poisson model. Suppose we calculate an AIC value for 
the Poisson FN network using 792 as the number of parameters involved. In that 
case, we receive a value of 191'790, thus, clearly lower than the one of the negative 
binomial GLM, and also slightly lower than the one of the hurdle Poisson model, 
see Table 6.6. Remark that AIC values within FN networks are not supported by 
any theory as we neither use the MLE nor do we have a reasonable evaluation of the 
number of parameters involved in networks. Thus, such a value may serve at best as 
a rough rule of thumb. 

This lower AIC value suggests that we should try to improve the modeling of 
the systematic effects by better regression functions. In particular, there may be 
more explanatory variables involved that have predictive power. If these explanatory 
variables are latent, we can rely on the negative binomial model, as it can be 
interpreted as a mixture model averaging over latent variables. In view of Fig. 7.29, 
the exposures v; seem to have a predictive power different from proportional scaling, 
see (7.48); we also mention some peculiarities of the exposures on page 556. This 
motivates to change the FN network regression model such that the exposures are 
considered non-proportionally. We choose a FN network that directly models the 
mean of the claim counts 


(x,v)EXxO1] => u(x, v) = exp (B, zD (x, v)) >0, (7.49) 


modeling the mean Ep [N] = u(x, v) of the Poisson datum (N, x, v). The expected 
frequency is then given by Ey[Y] = Ey[N/v] = u(x, v)/v. 


Remark 7.29 At this stage we clearly have to distinguish between statistical 
modeling and actuarial modeling. In statistical modeling it makes perfect sense 
to choose the regression function (7.49), since including the exposure in a non- 
proportional way may increase the predictive power of the model, at least this is 
what our data suggests. 
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From an actuarial point of view this approach should clearly be doubted. The 
typical exposure of car insurance policies is one calendar year, i.e., v = 1, if the 
renewals of insurance policies are accounted correctly. Shorter exposures may have 
a specific (non-predictable) reason, for example, the policyholder or the insurance 
company may terminate an insurance contract after a claim. Thus, if this is possible, 
the exposure is a random variable, too, and it clearly has a predictive power for 
claims prediction; in that case we lose the properties of the Poisson count process 
(having independent and stationary increments). 

As a consequence, we should include the exposure proportionally from an 
actuarial modeling point of view. Nevertheless we do the modeling exercise based 
on the regression function (7.49), here. This will indicate the predictive power of the 
exposure, which may be thought of a proxy for another (non-available) explanatory 
variable. Moreover, if (7.49) allows for a good Poisson regression model, we have a 
simple way of bootstrapping from our data (conditionally on given exposures v). 

We would also like to emphasize that if one feature component dominates all 
others in terms of the predictive power, then likely there is a leakage of information 
through this component, and this needs a more careful analysis. 


We implement the FN network regression model (7.49) using again a network 
architecture of depth d = 3 with (q1, q2,q3) = (20,15, 10) neurons. We use 
embedding layers for the two categorical variables VehBrand and Region, and 
we have 8 continuous/binary feature components. This is one more compared to 
Fig. 7.9 (rhs) because we also model the exposure v; as a continuous input to the 
network. As a result, the dimension r of the network parameter 3 € R” increases 
from 792 to 812 (because we have q1 = 20 neurons in the first FN layer). We 
calculate the nagging predictor 7!) of this network averaging over M = 500 
individual (early stopped) FN network calibrations, the results are presented in 
Table 7.13. 


Table 7.13 Number of parameters, in-sample and out-of-sample deviance losses (units are in 
107°), in-sample average frequency and (over-)dispersion of the Poisson null model, model Poisson 
GLM3 of Table 5.5, the FN network models (with embedding layers of dimension b = 2), the 
nagging predictors, and the middle nagging predictors excluding and including exposures v; as 
continuous network inputs 


# In-sample | Out-of-sample | Aver. | Disp. 

param. | loss on £ | loss on T freq. |? 
Poisson null 1 25.213 | 25.445 7.36% | 1.7160 
Poisson GLM3 50 24.084 =| 24.102 7.36% | 1.6644 
Embed FN jv’"=! 792 | 23.690 ‘| 23.824 7.36% | 1.6812 
Nagging FN ji‘) ‘792’ | 23.691 23.783 7.36% | 1.6592 
Middle nagging FN j™ “le *792’ | 23.698 | 23.272 7.36% | 1.6618 
Exposure v: FN Ti 812 23.358 | 23.496 7.36% | 1.0650 
Exposure v: nagging FN 4) ‘812’ | 23.299 23.382 7.36% | 1.0416 


Exposure v: middle nagging FN mde | ‘812? | 23.303 23.299 7.36% | 1.0427 
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Fig. 7.30 Average frequency frequency as a function of Exposure 
as a function of the exposure o 
v € (0, 1]: nagging predictors 

considering the exposures 2 
proportionally (blue), the ° 
model including exposures 
non-proportionally through 
the FN network (black) and 
observed (red) 
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We observe a major improvement when including the exposure v as an input 
to the network, i.e., by including the exposure non-proportionally into the mean 
estimate. This is true in-sample (we use early stopping here), and in terms of 
Pearson’s dispersion estimate; we set r = 812 for the number of parameters in 
Pearson’s dispersion estimate (5.30) which may be too big because we do not 
perform proper MLE, here. In particular, we receive a dispersion estimate close 
to one which, now, is in support of modeling the claim counts by Poisson random 
variables (using this regression function). That is, this regression function explains 
the systematic effects so that we no longer observe much over-dispersion in the data 
relative to the chosen model. However, we would like to remind of Remark 7.29 
which needs a careful consideration for the use of this regression model in insurance 
practice. 

This is also supported by Fig. 7.30 which studies the average frequency as a 
function of the exposure v € (0, 1]. The red observed average frequency has a 
clear decreasing slope which can be modeled by running the exposure v through the 
FN network (black), but not by including it proportionally (blue). From an actuarial 
modeling point of view this plot clearly questions the quality of the data, because 
there seem to be effects in the exposures that certainly require more investigation. 
Unfortunately, we cannot do this here because we do not have additional insight into 
this data set. This closes the example. 


7.4.5  Identifiability in Feed-Forward Neural Networks 


In the previous section we have studied ensembles of FN networks. One may also 
aim at directly comparing these networks to each other in terms of the fitted network 
parameters P” over the different calibrations 1 < j < M (of the same FN network 
architecture). Such a comparison may, e.g., be useful if one wants to choose a 
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prior parameter distribution z for # in a Bayesian setting. Comparing the different 


network calibrations P, 1 < j < M, of an architecture needs some care because 
networks have many symmetries that make the parameters non-identifiable. We 
can, for instance, permute the neurons in a FN layer z™ , with the corresponding 
permutation of the weights that connect this layer to the previous layer z“"~!) and to 
the succeeding layer z°"+!), The resulting predictive model under this permutation 
is the same as the original one. For this reason we need to introduce some order in a 
FN network to make the parameters identifiable. 

Riiger—Ossen [323] have introduced the notion of a fundamental domain for the 
network parameter #, and we briefly review this idea. We start with an explicit 
example. Assume that the activation function fulfills the anti-symmetry property 
—o(x) = ¢(—x) for all x € R, this is the case for the hyperbolic tangent. This 
implies several symmetries in the FN network parametrization. E.g., if we consider 
the output of a shallow FN network d = 1 with link function g, we can do a sign 
switch in a fixed neuron 1 < k < q1 


suon = n+ D0) = = p+ 6) 0(w, x} 


j= j=l 
= fot > bjo (w na x)+ C Pr) o(— w”, x ). (7.50) 
J#k 


From this we see that the following two network parameters (we switch signs in all 
the parameters that belong to index k) 


1 (1) 1 T 
v = (wP, wP, wP, Bo,- Bro -+> Bai) and 


Y 1) 1 1 T 
d= (wí ce „wO, Bo, -.-1— Bes +++ Bar) 


give the same FN network predictions. Beside these sign switches, we can also 
permute the enumeration of the neurons in a given FN layer, giving the same 
predictions. We discuss Theorem 2 of Rtiger—Ossen [323] to solve this identifiability 
issue. First, we consider the network weights from the input x to the first FN layer 
z (x). Apply the sign switch operation (7.50) to the neurons in the first FN layer 
so that all the resulting intercepts a eee a , wg? are positive while not changing 
the regression function x +> g(u(x)). Next, apply a permutation to the indices 
1 < j < qı so that we receive ordered intercepts 


qd) (a) 
Wo] >- > Wog, > 0, 


with an unchanged regression function x +> g(j(x)). To make these transforma- 
tions well-defined we need to assume that all intercepts are non-zero and mutually 
different (which we assume for the time-being). 
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Then, we move recursively through the FN layers 2 < m < d applying the sign 
switch operations and the permutations so that the regression function x œ> g(u(x)) 
remains unchanged and such that for all 1 < m < d 


This provides us with a unique representation of every network parameter 3 € R” 
in the fundamental domain 


[v ER uw sanna > 0 forall 1 <m< a} cR, (7.51) 


supposed that all intercepts are different from zero and mutually different in the 
same FN layers. As stated in Section 2.2 of Rtiger—Ossen [323], there may still exist 
different parameters in this fundamental domain that provide the same predictive 


model, but these are of zero Lebesgue measure. The same applies to the intercepts 
wy") being zero or having equal intercepts for different neurons. Basically, this 
means that we are fine if we work with absolutely continuous prior distributions 


on the fundamental domain when we want to work within a Bayesian setup. 


7.5  Auto-encoders 


Auto-encoders are tools that aim at reducing the dimension of high-dimensional 
data such that the reconstruction error of the original data is small, i.e., such that 
the loss of information by the dimension reduction is minimized. The most popular 
auto-encoder is the principal components analysis (PCA) which we are going to 
present here. The PCA is a linear dimension reduction technique. Bottleneck neural 
(BN) networks can be viewed as a non-linear extension of the PCA. This is going 
to be discussed in Sect. 7.5.5, below. Dimension reduction techniques belong to the 
family of unsupervised learning methods because they do not consider a response 
variable, but they aim at finding common structure in the features. Unsupervised 
learning methods can roughly be categorized into three classes: dimension reduction 
techniques (studied in this section), clustering methods and visualization methods. 
For a discussion of clustering and visualization methods we refer to the tutorial of 
Rentzmann—Wiithrich [310]. 
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7.5.1 Standardization of the Data Matrix 


Assume we have q-dimensional data points y; € R, 1 < i < n. This provides us 
with a data matrix 


y1,1 `°? Vig 
Y= s NS Bote a eRe, 


Yn,1 `t Ynq 


We assume that each of the q columns of Y measures a quantity in a given unit. 
The first column may, for instance, describe the age of a car driver in years, the 
second column his body weight in kilograms, etc. That is, each column 1 < j < q 
of Y describes a specific quantity, and each row yI of Y describes these quantities 
for a given instance 1 < i < n. Since often the analysis should not depend on 
the units of the columns of Y, one centers the columns with the empirical means 
Yj = } ;—1 Yi, /n, and one normalizes them with the empirical standard deviations 
Gj = (1 Orj — J)? n), 1 < j < q. This gives the normalized data matrix 


y1,1—7ı Yiq — Yq 
onl Og 
€ R™4, (7.52) 
Yn,1— Y1 Yn,q— Yq 
(oi Oq 


We typically center the data matrix Y, providing $` ;—; yj = 0 forall 1 < j <q, 
normalization w.r.t. the standard deviation can be done, but is not always necessary. 
Centering implies that we can interpret Y as a q-dimensional empirical distribution 
with each component (column) being centered. The covariance matrix of this 
(centered) empirical distribution is calculated as 


a iE 1 
eee (>: z9 =-Y'Y e R, (7.53) 
1<j,k<q 


F n 
i=1 


This is a covariance matrix, and if the columns of Y are normalized with the 
empirical standard deviations j, 1 < j < q, this is a correlation matrix. 


7.5.2 Introduction to Auto-encoders 


An auto-encoder encodes a high-dimensional vector y € R7 to a low-dimensional 
representation so that the dimension reduction leads to a minimal loss of infor- 
mation. A function L(.,-) : RY x RY — R+ is called dissimilarity function if 
L(y, y^) = 0 if and only if y = y’. 
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An auto-encoder is a pair (®, Y) of mappings, for given dimensions p < q, 
®: RI > R? and W:R? > RI, (7.54) 


such that their composition Y o ® has a small reconstruction error w.r.t. the chosen 
dissimilarity function L(., -), that is, 


y + L(y, Wo ®(y)) is small for all cases y of interest. (7.55) 


Note that we want (7.55) for selected cases y, and if they are within a p-dimensional 
manifold the auto-encoding will be successful. The first mapping ® : R? > R? is 
called encoder, and the second mapping Y : RP — R1 is called decoder. The object 
®(y) € R? is a p-dimensional encoding (representation) of y € R? which contains 
maximal information of y up to the reconstruction error (7.55). 


7.5.3 Principal Components Analysis 


PCA gives us a linear auto-encoder (7.54). If the data matrix Y € R”*41 has rank 
q, there exist q linearly independent rows of Y that span R1. PCA determines a 
different, very specific basis of R4. It looks for an orthonormal basis v1, ..., 0g € 
R1 such that vı explains the direction of the biggest variability in Y, v2 the direction 
of the second biggest variability in Y orthogonal to vı, and so forth. Variability is 
understood in the sense of maximal empirical variance under the assumption that 
the columns of Y are centered, see (7.52)—-(7.53). Such an orthonormal basis can 
be found by determining q linearly independent eigenvectors of the symmetric and 
positive definite matrix 


A=n =Y Y e RD, 


For this we can solve recursively the following convex Lagrange problems. The first 
basis vector vı € R4 is determined by the solution of? 


vı = argmax IY wl = argmax (wY Yw), (7.56) 


llwl2=1 wT w=1 


and the j-th basis vector v; € R1,2 < j < q, is received recursively by the solution 
of 


vj = arg max IY wll subject to (vg, w) = O forall 1 < k < j—1. (7.57) 
lwl2=1 


3 If the q eigenvalues of A are distinct, the solution to (7.56) and (7.57) is unique up to the sign, 
otherwise this requires more care. 
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Singular value decomposition (SVD) gives an alternative way of computing this 
orthonormal basis, we refer to Section 14.5.1 in Hastie et al. [183]. The algorithm 
of Golub—Van Loan [165] gives an efficient way of performing a SVD. There exist 
orthogonal matrices U € R’*4 and V € R@*4 (with UTU = V'V = 14), and 
a diagonal matrix A = diag(A,,..., àq) € R1%1 with singular values 4; >... > 
àq > 0 such that we have the SVD 


Y=UAV'. (7.58) 


The matrix U is called left-singular matrix of Y, and the matrix V is called right- 
singular matrix of Y. Observe by using the SVD (7.58) 


VIAVaVIY'YV=V'VAU'TUAV'V =A’ = diag(aj, ..., 02). 


That is, the squared singular values (7 )1< j<q are the eigenvalues of matrix A, and 
the column vectors of the right-singular matrix V = (v1,..., vg) (eigenvectors of 
A) give an orthonormal basis v1, ..., vq. This motivates to define the q principal 
components of Y by the column vectors of 
YV = UA =Udiag(Ay, ..., Aq) (7.59) 
= (Aru, aaa AgUq) e R”, 
E.g., the first principal component of the instances 1 < i < n is given by Yvı = 


Aju; € R”. Considering the first p < q principal components gives the rank p 
matrix 


Yp = Udiag(a1,...,47,0,...,0)V' € R"™4, (7.60) 


The Eckart-Young—Mirsky theorem [114, 279]* proves that this rank p matrix Y p 
minimizes the Frobenius norm relative to Y among all rank p matrices, that is, 


Yp € argmin || Y — B||F subject to rank(B) < p, (7.61) 
BeR"*4 


where the Frobenius norm is given by IC lle = ar ce, for a matrix C = (ci, ;)i,;- 
The orthonormal basis v1, ..., vg € R1 gives the (linear) encoder (projection) 


E 
PRIR, ye OY) = (yv. IT Mp) = (1 vp). 


4 In fact, (7.61) holds for both the Frobenius norm and the spectral norm. 
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These gives the first p principal components in (7.59) if we insert the transposed 
data matrix Y' = (Yis ---3 Yn) E R?” for y € R1. The (linear) decoder W is 
given by 


Y: R? > RI, z= W(Z) = (01, ..., Vp). 


The following is understood column-wise for the transposed data matrix Y! , 
Yop) =U (@, a, Tia 
AT 
= (Fierste ) 


: 
: 
= (Yrs 0710). O11 +s Vp Dptty ea 04) ) 


" 

= (Udiag@a1,...,2p,0,-., ov") =Y}. 
Thus, ¥ o (Y!) minimizes the Frobenius reconstruction error (7.61) on the data 
matrix Y' among all linear maps of rank p. In view of (7.55) we can express the 
squared Frobenius reconstruction error as 


IY- Y l= Y |y -Yo00 = J L(y Yooo), 16D 


i=l i=l 


thus, we choose the squared Euclidean distance as the dissimilarity measure, here, 
that we minimize simultaneously on all cases y;, 1 <i <n. 


Remark 7.30 The PCA gives a linear approximation to the data matrix Y by 
minimizing (7.61) and (7.62) for given rank p. This may not be appropriate if the 
non-linear terms are dominant. Figure 7.31 (lhs) gives a situation where the PCA 
works well; this data has been generated by i.i.d. multivariate Gaussian random 
vectors y; ~ N(0, £). Figure 7.31 (middle) gives a non-linear example where the 
PCA does not work well, the data matrix Y € R”*? is a column-centered matrix 
that builds a circle around the origin. 

Another nice example where the PCA fails is Fig.7.31 (rhs). This figure is 
inspired by Shlens [337] and Ruckstuhl [321]. It shows a situation where the level 
sets are non-convex, and the principal components point into a completely wrong 
direction to explain the structure of the data. 
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Fig. 7.31 Two-dimensional PCAs in different situations of the data matrix Y € R’*? 


7.5.4 Lab: Lee—Carter Mortality Model 


We use the SVD to fit the most popular stochastic mortality model, the Lee—Carter 
(LC) model [238], to (raw) mortality data. The raw mortality data considers for each 
calendar year t and each age x the number of people D,,; who died (in that year t 
at age x) divided by the corresponding population exposure ex +. In practice this 
requires some care. Due to migration, often, the exposures e,,; are non-observable 
figures and need to be estimated. Moreover, also the death counts D,,; in year t at 
age x can be defined differently, age cohorts are usually defined by the year of birth. 
We denote the (observed) raw mortality rates by My, = Dx,t/ex,t. The subsequent 
derivations consider the raw log-mortality rates log(M, +), for this reason we assume 
that My, > 0 for all calendar years t and ages x. The goal is to model these raw 
log-mortality rates (for each country, region, risk group and gender separately). 
The LC model defines the force of mortality as 


log(Mx,1) = ax + bxkt, (7.63) 


where log(ux,t) is the (deterministic) log-mortality rate in calendar year t for a 
person aged x (for a fixed country, region and gender). The individual terms in (7.63) 
have the following meaning: a, is the average force of mortality at age x, b, is the 
rate of change of the force of mortality broken down to the different ages x, and k; 
is the time index describing the change of the force of mortality in calendar year t. 

Strictly speaking, we do not have a stochastic model, here, that can explain the 
observations M, s, but we try to fit a deterministic mortality surface (1x,1)x,1 to 
these noisy observations (Mx +),,+. For this we use the PCA and the Frobenius norm 
as the measure of dissimilarity (on the log-scale). 

In a first step, we center the raw log-mortality rates for all ages x, i.e., over the 
calendar years t € 7 under consideration. We define the centered raw log-mortality 
rates yx, and the estimate @, of the average force of mortality at age x as follows 


1 


Yor = log(M,. 1) -a = log(M,,1) — ITI 


Y > log(My,s), (7.64) 
seT 
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where the last identity defines the estimate ax. Strictly speaking we have a slight 
difference to the centering in Sect.7.5.1 because we center the rows and not the 
columns of the data matrix, here, but the role of rows and columns is exchangeable in 
the PCA. The optimal (parameter) values (by) x and (kr) tį are determined as follows, 
see (7.63), 


: 2 
arg min Y; t — brki) , 
(by )x, (k)i 2 arik) 


where the sum runs over the years t € 7 and the ages xọ < x < x1, with xo and x; 
being the lower and upper age boundaries. This can be rewritten as an optimization 
problem (7.61)-(7.62). Consider the data matrix Y = (Yx t)xosx<xiter € R4, 
and set n = xı — x9 + 1 and q = |T|. Assume Y has rank q. This allows us to 
consider 


Yı € argmin || Y — B||f subject to rank(B) < 1. 
BeR"*4 


A solution to this problem is given, see (7.60), 
Yı = Udiag(41,0,...,0)V' = (Aimi)v] = (Yv,)v] e RY, 


with left-singular matrix U = (u1, ..., uq) € R”*? and right-singular matrix V = 
(V1,...,0q) € R?*4 of Y. This implies that the first principal component A;u, = 
Yvı € R” gives an estimate for (by)x)<x<x,, and the first column vector v; € R1 
of V gives an estimate for the time index (k;);-7. For parameter identifiability we 
normalize 


x 
b =1 and k, = 0, (7.65) 
2, >, 


x=x0 teT 


the latter being consistent with the centering of the rows of Y with @y in (7.64). 

We fit the LC model to the Swiss mortality data of females and males separately. 
The raw log-mortality rates log(M,.,) for the years t € T = {1950,..., 2016} 
and the ages 0 < x < 99 are illustrated in Fig. 7.32; both plots use the same color 
scale. This mortality data has been obtained from the Human Mortality Database 
(HMD) [195]. In general, we observe a diagonal structure that indicates mortality 
improvements over time. 
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Fig. 7.32 Raw log-mortality rates log(Mx,+) for the calendar years 1950 < t < 2016 and the ages 
xo = 0 < x < xı = 99 of Swiss females (lhs) and Swiss males (rhs); both plots use the same color 
scale 
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Fig. 7.33 LC fitted log-mortality rates log({z,,) for the calendar years 1950 < t < 2016 and the 
ages xo = 0 < x < xı = 99 of Swiss females (lhs) and Swiss males (rhs); the plots use the same 
color scale as Fig. 7.32 


Define the fitted log-mortality surface 
log({ix,1) = ay + Dyk, for xo <x <x, andt € T. 


Figure 7.33 shows the LC fitted log-mortality surface (log (jx,1))0<x<99:1eT Sepa- 
rately for Swiss females and Swiss males, the color scale is the same as in Fig. 7.32. 
The plots show a huge similarity between the raw log-mortality data and the LC 
fitted log-mortality surface which clearly supports the LC model for the Swiss 
data. In general, the LC surface is a smoothed version of the raw log-mortality 
surface. The main difference in our LC fit concerns the male population for ages 
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Fig. 7.34 (lhs) Singular values àj, 1 < j < |T], of the SVD of the data matrix Y € R”*IT1, and 
(rhs) the reconstruction errors ||Y — Y pile for0 < p< |T| 


20 < x < 40 from 1980 to 2000, one explanation of the special pattern in the 
observed data during that time is the emergence of HIV. 

Figure 7.34 (lhs) shows the singular values 4; > ... = Az; > O for 
Swiss females and Swiss males. We observe that the first singular value 41 by 
far dominates the remaining singular values A;, j > 2. Thus, the first principal 
component indeed may already be sufficient, and the centered raw log-mortality 
data Y can be described by a matrix Yı of rank p = 1. Figure 7.34 (rhs) gives 
the squared Frobenius reconstruction errors of the approximations Y, of ranks 
0 < p < |T|, where Yo corresponds to the zero matrix where we do not use any 
approximation, but use just the average observed log-mortality rate. We observe that 
the first singular value leads by far to the biggest decrease in the reconstruction error, 
and the subsequent expansions Àj, j > 2, improve it only slightly in each step. This 
supports the use of the LC model using a rank p = 1 approximation to the centered 
raw log-mortality rates Y. The higher rank PCA within mortality modeling has 
been studied in Renshaw—Haberman (RH) [308], and the RH(p) mortality model 
considers the rank p approximation Y , to the raw log-mortality rates Y given by 


log(ux,t) = ax + (bx, kt), 


for by, k; € RP. 

We have (only) fitted a mortality surface to the raw log-mortality rates on the 
rectangle {xo,...,x1} x T. This does not allow us to forecast mortality into the 
future. Forecasting requires a two step procedure, which, after this first estimation 
step, extrapolates the time index (time-series) (ky) te7 beyond the latest observation 
point in 7. The simplest (meaningful) model for this second (extrapolation) step 
is a random walk with drift for the time index process (ki)r>0- Figure 7.35 shows 
the estimated two-dimensional process (k;);e7, i.e., for p = 2, on the rectangle 
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Fig. 7.35 Estimated two-dimensional processes (ki)rer for Swiss females (lhs) and Swiss males 
(rhs); these are normalized such that they are centered and such that the components of by add up 
tol 


{xo,..-,X1} x T which needs to be extrapolated to predict within the RH (p = 2) 
mortality model. We refrain from doing this step, but extrapolation will be studied 
in Sect. 8.4, below. 


7.5.5 Bottleneck Neural Network 


BN networks have become popular in studying non-linear generalizations of PCA, 
we refer to Kramer [225] and Hinton—Salakhutdinov [186]. The BN network 
architecture is such that (1) the input dimension qo is equal to the output dimension 
qda+\ Of a FN network, and (2) in between there is a FN layer 1 < m < d that has a 
very low dimension gm < qo, called the bottleneck. Figure 7.36 (lhs) shows such a 
BN network of depth d = 3 and neurons 


(Go; q1, 42, 93, 94) = (20, 7, 2, 7, 20). 


The input and output neurons have blue color, and the bottleneck of dimension q2 = 
2 is shown in red color in Fig. 7.36 (lhs). 
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Fig. 7.36 (lhs) BN network of depth d = 3 with (go, q1, 42, 43, q4) = (20, 7, 2, 7, 20), (middle 
and rhs) shallow BN networks with a bottleneck of dimensions 7 and 2, respectively 


The motivation is as follows. Assume we have a given dissimilarity function 
L(,-) : R1 x RI — R, that measures the reconstruction error of an auto-encoder 
Wo Ẹ@(y) € RI relative to the original input y € R4, see (7.55). We try to find a BN 
network with input and output dimensions go = qa+1 = q (we drop the intercepts in 
the entire construction) and a bottleneck in layer m having a low dimension qm, such 
that the BN network provides a small reconstruction error. Choose a FN network 


yeR! & Wo Gy) = 24 Diy) = (ge oz 9-.-0 m (y) € RY, 


with FN layers for 1 < m < d (excluding intercepts) 


(m) on) (m) (m) py)" 
z : Ram-l — Ra”, Zez (z) = (o(w} š Z), Sari p(w, i z)) á 


and having network weights wi") e R%-!, 1 < j < qm. For the output we choose 


the identity function as activation function 


T 
Za) : Ru > RIH, ze z®tD (z) = (we, Zgucsis we, z)) , 
and having network weights we e R%,1 < j < qa41. The resulting network 


parameter # is now fitted to the data matrix Y = (y,,..., Yp)! € R”*4 such that 
the reconstruction error is minimized over all instances 


~ 


n 
ov = argmin bee (yi, Wo ®(y;)) = argmin bee (v: zCtED O) : 
VER” i vER” i=1 


We use this fitted network parameter and denote the resulting FN layers by 2” 
forl<m<d+l. 
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This allows us to define the BN encoder, set q = qo and p = qm, 


©: RM +R! ye by) =" Dy) = (a oe 02) (y), 
(7.66) 
and the BN decoder is given by, set qm = p and qa+1 = q, 


Y: Rm > RYH, zi Wz) = FEM D (zy = ee ae oma) 


The BN encoder (7.66) gives us a q-dimensional representation of the data. A 
linear rank p representation Y „ of Y, see (7.61), can be found by a BN network 
architecture that has a minimal FN layer width of dimension p = minj<j<q qj, and 
with the identity activation function (x) = x. Such a BN network is a linear map 
of maximal rank p. Using the Euclidean square distance as dissimilarity measure 
provides us an optimal network parameter P for this linear map such that we receive 
Y; = gat Dy), There is one point to be considered, here, why the bottleneck 


activations ®(y) = Z") (y) € R? in the linear activation case are not directly 
comparable to the principal components (y'v1,..., y Tv ai” of the PCA. Namely, 
the PCA uses an orthonormal basis v1, ..., Vp whereas the linear BN network case 
uses any p-dimensional basis, i.e., to directly bring these two representations in line 
we still need a coordinate transformation of the bottleneck activations. 

Hinton—Salakhutdinov [186] noticed that the gradient descent fitting of a BN 
network needs some care, otherwise we may find a local minimum of the loss 
function that has a poor reconstruction performance. In order to implement a more 
sophisticated way of SGD fitting we require that the depth d of the network is an 
odd number and that the network architecture is symmetric around the central FN 
layer (d + 1)/2. This is the case in Fig. 7.36 (lhs). Fitting of this network of depth 
d = 3 is now done in three steps: 


1. The symmetry around the central FN layer m = 2 allows us to collapse this 
central layer by merging layers 1 and 3 (because qı = q3). Merging these two 
layers provides us a shallow BN network with neurons (q0, q1 = 93,4d+1 = 
qo) = (20,7, 20). This shallow BN network is shown in Fig. 7.36 (middle). 


In a first step we fit this simpler network to the data Y. This gives us the 
d) (d) (4) (4) 


preliminary estimates for the network weights w} °,..., Wg, andw,,..., Wg, 
of the full BN network. From this fitted shallow BN network we receive the 
learned representations z; = z"!)(y;) € RU, 1 < i < n, in the central layer 


using the preliminary estimates of the network weights. 

2. In the second step we use the learned representations z; € R, 1 <i < n, to 
fit the inner part of the original network (using a suitable dissimilarity function). 
This inner part is a shallow network with neurons (q1, q2, 93 = q1) = (7, 2,7), 


354 7 Deep Learning 


see Fig. 7.36 (rhs). This second step gives us the preliminary estimates for the 
network weights w, nee wD and w®, ees we? of the full BN network. 

3. In the final step we fit the full BN network on the data Y and use the preliminary 
estimates of the weights (of the previous two steps) as initialization of the 


gradient descent algorithm. 


Example 7.31 (BN Network Mortality Model) We apply this BN network approach 
to modify the LC model of Sect. 7.5.4. Hainaut [178] considered such a BN network 
application. For computational reasons, Hainaut [178] proposed a calibration 
strategy different from Hinton—Salakhutdinov [186]. We use this latter calibration 
strategy as it has turned out to work well in our setting. 

As BN network architecture we choose a FN network of depth d = 3. The input 
and output dimensions are equal to gg = q4 = 67, this exactly corresponds to 
the number of available calendar years 1950 < t < 2016, see Fig. 7.32. Then, we 
select a symmetric architecture around the central FN layer m = 2 with q1 = q3 = 
20 neurons. That is, in a first step, the 67 calendar years are compressed to a 20- 
dimensional representation. For the bottleneck we then explore different numbers 
of neurons q2 = p € {1,..., 20}. These BN networks are implemented and fitted in 
R with the library keras [77]. We have fitted these models separately to the Swiss 
female and male populations. The raw log-mortality rates are illustrated in Fig. 7.32, 
and for comparability with the LC approach we have centered these log-mortality 
rates according to (7.64), and we use the squared Euclidean distance as the objective 
function. 

Figure 7.37 compares the squared Frobenius reconstruction errors of the linear 
LC approximations Y, to their non-linear BN network counterparts with bottle- 
necks q2 = p. We observe that the BN figures are clearly smaller saying that a 
non-linear auto-encoding provides a better reconstruction, this is true, in particular, 
for 2 < q2 < 20. For q2 > 20 the learning with the BN networks seems saturated, 
note that the outer layers have gj = q3 = 20 neurons which limits the learning at 
the bottleneck for bigger q2. In view of Fig. 7.37 there seems to be a kink at q2 = 4, 


Fig. 7.37 Frobenius Frobenius norm reconstruction error 
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Fig. 7.38 BN network (q1, 492,93) = (20,2, 20) fitted log-mortality rates log(fx,+) for the 
calendar years 1950 < t < 2016 and the ages x» = 0 < x < xı = 99 of Swiss females 
(left) and Swiss males (right); the plots use the same color scale as Fig. 7.32 


and an “elbow” criterion says that this is the critical bottleneck size that should not 
be exceeded. 

The resulting estimated log-mortality surfaces for the bottleneck q2 = 2 are 
illustrated in Fig.7.38. These strongly resemble the raw log-mortality rates in 
Fig. 7.32, in particular, for the male population we get a better fit for ages 20 < 
x < 40 from 1980 to 2000 compared to the LC model. In a further analysis we 
should check whether this BN network does not over-fit to the data. We could, e.g., 
explore drop-outs during calibration or smaller FN (compression) layers q1 = q3. 

Finally, we analyze the resulting activations at the bottleneck by considering the 
BN encoder (7.66). Note that we assume y € RI in (7.66) with q = |T] being 
the rank of the data matrix Y € R”*41. Thus, the encoder takes a fixed age 0 < 
x < 99 and encodes the corresponding time-series observation y, € RIT! by the 
bottleneck activations. This parametrization has been inspired by the PCA which 
typically considers a data matrix that has more rows than columns. This results in 
at most g = rank(Y) singular values, supposed n > q. However, we can easily 
exchange the role of rows and columns, e.g., by transposing all matrices involved. 
For mortality forecasting it is advantageous to exchange these roles because we 
would like to extrapolate a time-series beyond 7. For this reason we set for the input 
dimension go = q = 100, which provides us with |7| observations y, € R!°. We 
then fit the BN encoder (7.66) to receive the bottleneck activations 


Y= (eT > PY) = (OO) teT € R271, 
Figure 7.39 shows these figures for a bottleneck q2 = 2. We observe that these 


bottleneck time-series (®(y,));-7 are much more difficult to understand than the 
LC/RH ones given in Fig. 7.35. Firstly, we see that we have quite some dependence 
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Fig. 7.39 BN network (q1, q2, q3) = (20, 2, 20): bottleneck activations showing ®(y,) € R? for 
teT 


between the components of the time-series. Secondly, in contrast to the LC/RH case 
of Fig. 7.35, there is not one component that dominates. Note that this dominance 
has been obtained by scaling the components of (b,), to add up to 1 (which, 
of course, reflects the magnitudes of the singular values). In the non-linear case, 
these scales are hidden in the decoder which is more difficult to extract. Thirdly, 
the extrapolation may not work if the time-series has a trend and if we use the 
hyperbolic tangent activation function that has a bounded range. In general, a trend 
extrapolation has to be considered very carefully with FN networks with non-linear 
activation functions, and often there is no good solution to this problem within 
the FN network framework. We conclude that this approach improves in-sample 
mortality surface modeling, but it leaves open the question about forecasting the 
future mortality rates because an extrapolation seems more difficult. a 


Remark 7.32 The concept of BN networks has also been considered in the actuarial 
literature to encode geographic information, see Blier-Wong et al. [39]. Since 
geographic information has a natural spatial component, these authors propose 
to use a convolutional neural network to encode the spatial information before 
processing the learned features through a BN network. The proposed decoder may 
have different forms, either it tries to reconstruct the whole (spatial) neighborhood 
of a given location or it only tries to reconstruct the site of a given location. 
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7.6 Model-Agnostic Tools 


We collect some model-agnostic tools in this section that help us to better understand 
and analyze the networks, their calibrations and predictions. Model-agnostic tools 
are techniques that are not specific to a certain model type and can be used for 
any regression model. Most methods presented here are nicely presented in the 
tutorial of Lorentzen—Mayer [258]. There are several ways of getting a better 
understanding of a regression model. First, we can analyze variable importance 
which tries to answer similar questions to the GLM variable selection tools 
of Sect.5.3 on model validation. However, in general, we cannot rely on any 
asymptotic likelihood theory for such an analysis. Second, we can try to understand 
the predictive model. For a GLM with the log-link function this is quite simple 
because the systematic effects are of a multiplicative nature. For networks this 
is much more complicated because we allow for much more general regression 
functions. We can either try to understand these functions on a global portfolio level 
(by averaging the effects over many insurance policies) or we can try to understand 
these functions locally for individual insurance policies. The latter refers to local 
sensitivities around a chosen feature value x € X, and the former to global model- 
agnostics. 


7.6.1 Variable Permutation Importance 


For GLMs we have studied the LRT and the Wald test that have been assisting us 
in reducing the GLM by the feature components that do not contribute sufficiently 
to the regression task at hand, see Sects. 5.3.2 and 5.3.3. These variable reduction 
techniques rely on an asymptotic likelihood theory. Here, we need to proceed 
differently, and we just aim at ranking the variables by their importance, similarly 
to a drop1 analysis, see Listing 5.6. 

For a given FN network regression model 


xeX we p(x) =e (Bz (xy), 


we randomize one component of x = (%,..., g)" at a time, and we study the 
resulting change in the objective function. More precisely, for given (learning) data 
L, with features x1,..., Xn, we select one feature component 1 < j < q and 
permute (x; ;)1<i<n randomly across the entire portfolio 1 < i < n. We denote by 
LO) the resulting data with the j-th component being permuted. We then compare 
the resulting deviance loss D(L”, u) to the one D(L, u) on the original data £ 
using the same regression model u. We call this approach variable permutation 
importance (VPI). Note that such a permutation does not only act on the marginal 
effects, but it also distorts the interaction effects of the different feature components. 
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Fig. 7.40 VPI measured by variable permutation importance (VPI) 
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We calculate the VPI on the MTPL claim frequency data of model Poisson 
GLM3 of Table 5.5 and the FN network regression model j2”"=! of Table 7.9; we 
use this example throughout this section on model-agnostic tools. Figure 7.40 shows 
the relative increases 


vpi?) _ aL, Bb) - DE, u) 
D(L, u) 
of the deviance losses by permuting one feature component 1 < j < q at a time. 

Obviously, the BonusMalus level followed by DrivAge and VehBrand are 
the most important variables according to this VPI method. This is in alignment for 
both models. Thereafter, there are smaller disagreements between the two models. 
These disagreements may (also) be caused by a non-optimal feature pre-processing 
in the GLM where, for instance, we have to add the interaction effects manually, 
see (5.35). Overall, these VPI results are in line with the findings of the classical 
methods on GLMs, see for instance the drop1 table in Listing 5.6. 

One point that is worth mentioning (and which makes the VPI results not fully 
reliable) is the use of feature components that are highly correlated. In our case, 
Density and Area are highly correlated, see Fig. 13.12. Therefore, it may not 
make sense to randomly permute one component while keeping the other one 
unchanged. This issue will also arise in other methods described below. 


Remark 7.33 (Global Surrogate Model) There are other machine learning methods 
that offer different measures of variable importance. For instance, (binary split) 
classification and regression trees (CARTs) offer popular methods for measuring 
variable importance; for binary split CARTs we refer to Breiman et al. [54] 
and Denuit et al. [100]. These CARTs select individual feature components for 
partitioning the feature space X, and variable importance is measured by analyzing 
the contribution of each feature component to the total decrease of the objective 


7.6 Model-Agnostic Tools 359 


function. Binary split CARTs have the advantage that this can be done in an additive 
way. 

More complex regression models like FN networks can then be analyzed by using 
a binary split regression tree as a global surrogate model. That is, we can fit a CART 
to the network regression function (as a surrogate model) and then analyze variable 
importance in this surrogate regression tree model using the tools of regression trees. 
We will not give an explicit example here because we have not formally introduced 
regression trees in this manuscript, but this concept is fairly straightforward and 
well-understood. 


7.6.2 Partial Dependence Plots 


There are several graphical tools that study the individual behavior in the feature 
components. Some of these tools select individual insurance policies and others 
study global portfolio properties. They have in common that they are based on 
marginal considerations, i.e., some sort of projection. 


Individual Conditional Expectation 


Individual conditional expectation (ICE) selects individual insurance policies 
(Yi, xi, vi) and varies the feature components of x; over their entire domain; 
we refer to Goldstein et al. [164]. Similarly to the VPI of Sect. 7.6.1, ICE does 
not respect collinearity in feature components, but it is rather an isolated view of 
individual components. 

In Fig. 7.41 we provide the ICE plots of model Poisson GLM3 of Table 5.5 and 
the FN network regression model @”=! of Table 7.9 of 100 randomly selected 
insurance policies x;. For these randomly selected insurance policies we let the 
variable DrivAge vary over its domain {18,..., 90}. Each color corresponds to 
one insurance policy i, and the colors in the two plots coincide. In the GLM 
we observe that the lines are roughly parallel which reflects that we have an 
additive regression structure on the canonical scale (note that these plots are on the 
canonical parameter scale). The lines are not perfectly parallel because we allow 
for an interaction between DrivAge and BonusMalus in model Poisson GLM3, 
see (5.35). The plot of the FN network is more difficult to interpret. Overall the 
levels (colors) coincide in the two plots, but in the FN network plot the lines are not 
increasing for ages approaching 18, the reason for this is that we have interactions 
with other feature components that are important. In particular, for ages close to 
18 we cannot have a BonusMal1us level of 50% and, therefore, the FN network 
cannot be trained on this part of the feature space. Nevertheless, the ICE plot allows 
for such feature configurations (by just extrapolating the FN network regression 
function beyond the set of available insurance policies). This difficulty is confirmed 
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Fig. 7.41 ICE plots of 100 randomly selected insurance policies x; of (Ihs) model Poisson GLM3 
and (rhs) FN network j7”"=! letting the variable DrivAge vary over its domain; the y-axis is on 
the canonical parameter scale 


by exploiting the same plot only on insurance policies that have a BonusMalus 
level of at least 100%. In that case the lines for small ages are non-decreasing when 
approaching the age of 18, thus, providing a more reasonable interpretation. We 
conclude that if we have strong dependence and/or interactions between the feature 
components this method may not provide any reasonable interpretations. 


Partial Dependence Plot 


Partial dependence plots (PDPs) have been introduced by Friedman [141], see also 
Zhao-Hastie [405]. PDPs are closely related to the do-operator in causal inference 
in statistics; we refer to Pearl [298] and Pearl et al. [299] for the do-operator. A 
PDP and the do-operator, respectively, are obtained by breaking the dependence 
structure between different feature components. Namely, we decompose the feature 
x = (xj, x\j) into two parts with x\; denoting all feature components except of 
component xj; we will use a slight abuse of notation because the components need 
to be permuted correspondingly in the following regression function x > u(x) = 
U(xj, X\;). Since, typically, there is dependence between x; and x\ j one can infer 
x\; from xj, and vice versa. A PDP breaks this inference potential so that the 
sensitivity can be studied purely in xj. In particular, the partial dependence profile 
is obtained by 


xj e piap = | uaj ay) dpe), (7.67) 
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where p(x\ j) is the marginal (portfolio) distribution of the feature components x\ j. 
Observe that this differs from the conditional expectation which reads as 


xj b> uaj) =E, [wy x,)| x5] = f wes.) doel, 


the latter allowing for inferring x\; from x; through the conditional probability 
dp(x\ j|x;). 


Remark 7.34 (Discrimination-Free Insurance Pricing) Recent actuarial literature 
discusses discrimination-free insurance pricing which aims at developing a pricing 
framework that is free of discrimination w.r.t. so-called protected characteristics 
such as gender and ethnicity; we refer to Guillén [174], Chen et al. [69, 70], 
Lindholm et al. [253] and Frees—Huang [136] for discussions on discrimination 
in insurance. In general, part of the problem also lies in the fact that one can 
often infer the protected characteristics from the non-protected feature information. 
This is called indirect discrimination or proxy discrimination. The proposal of 
Lindholm et al. [253] for achieving discrimination-free prices exactly follows the 
construction (7.67), by breaking the link, which infers the protected characteristics 
from the non-protected ones. 


The partial dependence profile on our portfolio £ with given features x1,..., Xn 
is now obtained by just using the portfolio distribution as an empirical distribution 
for p in (7.67). That is, for a selected component x; of x, we consider the partial 
dependence profile 


1 n 1 n 
xe Wj) = Pi Y uaj xi) = you (tiii aesti iig) 
i=l i=l 


thus, we average the ICE plots over x;,\ ; of our portfolio 1 <i <n. 
Figure 7.42 (lhs, middle) give the PDPs of the variables BonusMalus and 
DrivAge of model Poisson GLM3 and the FN network 7’"=!. Overall they 
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Fig. 7.42 PDPs of (lhs) BonusMalus level and (middle) DrivAge; the y-axis is on the 
canonical parameter scale; (rhs) ratio of policies with a bonus-malus level of 50% per driver’s 
age 
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look reasonable. However, we are again facing the difficulty that these partial 
dependence profiles consider feature configurations that should not appear in our 
portfolio. Roughly 57% of all insurance policies have a bonus-malus level of 50%, 
which means that these driver’s did not suffer any claims in the past couple of 
years. Obviously a driver of age 18 cannot be on this bonus-malus level, simply 
because she/he is not in a state where she/he can have multiple years of driving 
experience without an accident. However, the PDP does not respect this fact, and just 
extrapolates the regression function into that part of the feature space. Therefore, the 
PDP at driver’s age 18 is based on 57% of the insurance policies being on a bonus- 
malus level of 50% because this corresponds to the empirical portfolio distribution 
p(x\j;) excluding the driver’s age x; = DrivAge information. Figure 7.42 (rhs) 
shows the ratio of insurance policies that have a bonus-malus level of 50%. We 
observe that this ratio is roughly zero up to age 28 (orange vertical dotted line), 
which indicates that a driver needs 10 successive accident-free years to reach the 
lowest bonus-malus level (starting from 100%). We consider it to be data error that 
this ratio below age 28 is not identically equal to zero. We conclude that these PDPs 
need to be interpreted very carefully because the insurance portfolio is not uniformly 
distributed across the feature space. In some parts of the feature space the regression 
function x +> u(x) may not even be well-defined because certain combinations of 
feature values x may not exist (e.g., a driver of age 18 on bonus-malus level 50% or 
a boy at a girl’s college). 


Accumulated Local Effects Profile 


PDPs have the problem that they do not respect the dependencies between the 
feature components, as explained in the previous paragraphs. The accumulated local 
effects (ALE) profile tries to take account for these dependencies by only studying 
a local feature perturbation, we refer to Apley—Zhu [13]. We present a smooth 
(gradient-based) version of ALE because our regression functions are differentiable. 
Consider the local effect in the individual feature x w.r.t. the component x; by 
studying the partial derivative 
ðu(x 


uj&) = T (7.68) 


The average local effect of component j is received by 
Xj > Aj (xj; LL) = f Ujj, x\j)dp(x\jlxj). (7.69) 


ALE integrate the average local effects A ;(-) over their domain, and the ALE profile 
is defined by 


Xj Xj 
xj > f Ajj; wz; al feieim rape leprae), (7.70) 
X jo Xjo 


eee ee 
BPWNK TOMA DMPWNE 
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where xj) is a given initialization point. The difference between PDPs and ALE 
is that the latter correctly considers the dependence structure between x; and x\j, 
see (7.69). 


Listing 7.10 Local effects through the gradients of FN networks in keras [77] 


Input = layer_input(shape = c(11), dtype = ‘float32’, name = ‘Design’ ) 

# 

Output = Input %>% 
layer_dense 
layer_dense 
layer_dense 
layer_dense 


units=20, activation=’tanh’, name=’FNLayerl’) %>% 
units=15, activation=’tanh’, name=’FNLayer2’ 
units=10, activation=’tanh’, name=’FNLayer3’ 


units=1, activation=’linear’, name=’Network’ 


) 
) 
) 
) 


# 
model = keras_model(inputs = c(Input), outputs = c(Output) ) 
# 
grad = Output %>% 
layer_lambda(function(x) k_gradients(model$outputs, modelS$inputs) ) 
model.grad = keras_model(inputs = c(Input), outputs = c(grad)) 
theta.grad <- data.frame(model.grad %>% predict (XX) ) 


Example We come back to our MTPL claim frequency FN network example. The 
local effects (7.68) can directly be calculated in the R library keras [77] for a FN 
network, see Listing 7.10. In order to do so we need to drop the embedding layers, 
compared to Listing 7.4, and directly work on the learned embeddings. This gives 
an input layer of dimension q = 7 + 2 + 2 = 11 because we have two categorical 
features that have been embedded into 2-dimensional Euclidean spaces R*. Then, 
we can formally calculate the gradient of the FN network w.r.t. its inputs which is 
done on lines 11-13 of Listing 7.10. Remark that we work on the canonical scale 
because we use the linear activation function on line 7 of the listing. 


There remain the averaging (7.69) and the integration (7.70) which can be done 
empirically 


1 
ap AFasW=T— Yo uj, (7.71) 
IE(xj)| E 
icE(xj) 
where E(x;j) denotes the indices i of all cases x;, 1 < i < n, with xij = xj, 


assuming of having discrete feature data observations. Note that this empirical 
averaging respects the dependence within x. The (uncentered) ALE profile is then 
obtained by aggregating these local effects, that is, 


Xj 
xj e Ña) = | AG waa), 
X 


X jo 
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where this integration is typically understood in a discrete sense because the 
observed feature components x;,; are discrete. Often, this uncentered ALE profile is 
still translated (centered) by the portfolio average. 


Remarks 7.35 


e We have only introduced ALE for continuous feature variables. For nominal 
categorical feature components it is not immediately clear how to reasonably 
integrate the average local effects A ;(x;; 2), and one typically directly analyzes 
these average local effects. 

e For GLMs the ALEs are rather simple if we work on the canonical scale and 
under the canonical link, since 


00(x) 


In the case of model Poisson GLM3 presented in Sect. 5.3.4 the situation is 
more delicate as we model the interactions in the GLM as follows, see (5.34) 
and (5.35), 


(DrivAge, BonusMalus) 
4 
> 6) DrivAge + fılog(DrivAge) + > Bi+j (DrivAge)/ 
j=2 


+ßi+5BonusMalus + fj46 BonusMalus - DrivAge 


+8]47BonusMalus - (DrivAge)’. 


In that case, though we work with a GLM, the resulting local effects are different 
if we calculate the derivatives w.r.t. DrivAge and BonusMalus, respectively, 
because we explicitly (manually) include non-linear effects into the GLM. 


Figure 7.43 shows the ALE profiles of the variables BonusMalus and 
DrivAge. The shapes of these profiles can directly be compared to the PDPs 
of Fig. 7.42 (the scale on the y-axis should be ignored because this will depend 
on the applied centering, however, we hold on to the canonical scale). The main 
difference between these two plots can be observed for the variable DrivAge at 
low ages. Namely, the ALE profiles have a different shape at low ages respecting 
the dependencies in the feature components by only considering real local feature 
configurations. 
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Fig. 7.43 ALE profiles of (Ilhs) BonusMalus level and (rhs) DrivAge; the y-axis is on the 
log-scale 


7.6.3 Interaction Strength 


Next we are going to discuss pairwise interaction strength. Friedman—Popescu [143] 
made the following proposal. Roughly speaking, there is an interaction between the 
two feature components x; and x, of x in the regression function x +> u(x) if 


_ P u(x) 
Uj kx) = axax Æ 0. (7.72) 


This means that the magnitude of a change of the regression function u(x) in xj 
depends on the current value of xg. If there is no such interaction, we can additively 
decompose the regression function u(x) into two independent terms. This then 
reads as u(x) = m\j(x\j) + uyk(x\x). This motivation is now applied to the 
PDP profiles given in (7.67). We define the centered versions x; +> j1/(xj;) and 
xg > j*(x,) of the PDP profiles by centering the PDP profiles x je jul (x j) 
and xp |> a (xx) over the portfolio values x;, 1 < i < n. Next, we consider an 
analogous two-dimensional version for (xj, xx). Let (xj, xk) te üi Oe, xx) be the 
centered version of a two-dimensional PDP profile (xj, x) > pl Cte Xk). 

Friedman’s H-statistics measures the pairwise interaction strength between the 
components x; and xg, and it is defined by 


wa wa . 2 
ey (i i,j, xie) — AI (i,j) — HE Oi.) 


p2 — Liz! k, , (1.13) 
jk Da ADK i j, xik)? 


we refer to formula (44) in Friedman—Popescu [143]. While H; 7 measures the 
proportion of the joint interaction effect, as we normalize by the variability of 
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the joint effect Xi ADE Xij, xi), sometimes also the absolute measure is 
considered by taking the square root of the enumerator in (7.73). Of course, this 
can be extended to interactions of three components, etc., we refer to Friedman- 
Popescu [143]. 

We do not give an example here, because calculating Friedman’s H -statistics 
can be computationally demanding if one has many feature components with many 
levels in FN network modeling. 


7.6.4 Local Model-Agnostic Methods 


The above methods like the PDP and the ALE profile have been analyzing the global 
behavior of the regression functions. We briefly mention some tools that describe the 
local sensitivity and explanation of regression results. 

Probably the most popular method is the locally interpretable model-agnostic 
explanation (LIME) introduced by Ribeiro et al. [311]. This analyzes locally the 
expected response of a given feature x by perturbing x. In a nutshell, the idea is to 
select an environment E(x) C ¥ of a chosen feature x and to study the regression 
function x’ > u(x’) in this environment x’ € E(x). This is done by fitting a 
(much) simpler surrogate model to u on this environment € (x). If the environment 
is small, often a linear regression model is chosen. This then allows one to interpret 
the regression function u(-) locally using the simpler surrogate model, and if we 
have a high-dimensional feature space, this linear regression is complemented with 
LASSO regularization to only select the most important feature components. 

The second method considered in the literature is the Shapley additive expla- 
nation (SHAP). The SHAP is based on Shapley values [335] which is a method 
of allocating rewards to players in cooperative games, where a team of individual 
players jointly contributes to a potential success. Shapley values solve this allocation 
problem under the requirements of additivity and fairness. This concept can be 
translated to analyzing how individual feature components of x contribute to the 
total prediction u(x) of a given case. Shapley values allow one to do such a 
contribution analysis in the aforementioned additive and fair way, see Lundberg—Lee 
[261]. The calculation of SHAP values is combinatorially demanding and therefore 
several approximations have been proposed, many of them having their own caveats, 
we refer to Aas et al. [1]. We will not further consider these but refer to the relevant 
literature. 


7.6.5 Marginal Attribution by Conditioning on Quantiles 


The above model-agnostic tools have mainly been studying the sensitivities of the 
expected response u(x) in the feature components of x. This becomes apparent 
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from considering the partial derivatives (7.68) to calculate the local effects. Alterna- 
tively, we could try to understand how the feature components of x contribute to a 
given response u(x), see Ancona et al. [12]; this section follows Merz et al. [273]. 
The marginal attribution on an input component j of the response u(x) can be 
studied by the directional derivative 


Su) 


xj > xXpujX) = xj ie (7.74) 


This was first proposed to the data science community by Shrikumar et al. [340]. 
Basically, it means that we replace the partial derivative 1; (x) by the directional 


derivative along the vector xje; = (0,...,0, xj,0,...,0)' € RI! 
lim LÆ tstie/) —#@) 
«>0 É 
1, x1,- Xj1, 1+ ©)x;,xj41,.-.,%g)/) — U(x 
_ tim £C 1 j-l ( ) joe jt q) ) u( EE 
E> € b 


where e; is the (j + 1)-st basis vector in RI+! (index j = 0 corresponds to the 
intercept component xp = 1). 

We start by recalling the sensitivity analysis of Hong [189] and Tsanakas— 
Millossovich [355] in the context of risk measurement. Assume the features have 
a portfolio distribution X ~ p. This describes the random selection of an insurance 
policy X = x from the portfolio described by p. The average price over the entire 
portfolio is then given by 


w= Epla(X)1= f udp. 


We implicitly interpret u(X) = E[Y|X] as the price of the response Y, here, 
though we do not need the response distribution in this section. Assume u(X) 
has a continuous distribution function F,,;x); and we drop the intercept component 
Xo = xo = 1 from these considerations (but we still keep it in the regression 
model). This implies that U(x) = Fux) (u(X)) is uniformly distributed on [0, 1]. 
Choosing a density ¢ on [0, 1] gives us a probability distortion ¢(U,,(x)) as we have 
the normalization 


1 
čp [Uw] = [ C(u)du = 1. 


This allows us to define a distorted portfolio price in the sense of a Radon—Nikodym 
derivative, namely, we set for the distorted portfolio price 


o(u(X); 6) = Ep [u(X)EU yxy) - 
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This functional e(u(X); ¢) is a so-called distortion risk measure. Our goal is to 
study the sensitivities of this distortion risk measure in the components of X. 
Assume existence of the following directional derivatives for all 1 < j <q 


Sj; D= È e (u (0, X1- Xj- OX) Kp KO) t) 


S; (u; ¢) can be used to describe the sensitivities of the regression function X > 
u(X) in the feature components Xj. Under different sets of assumptions, Hong 
[189] and Tsanakas—Millossovich [355] have proved the following identity 


Sj(us 6) = Ep [Xjuj(KEUpa)]; 


the right-hand side exactly uses the marginal attribution (7.74). There remains the 
freedom of the choice of the density ¢ on [0,1], which allows us to study the 
sensitivities of different distortion risk measures. For the uniform distribution ¢ = 1 
on [0, 1] we simply have the average (best-estimate) price and its average marginal 
attributions 


o(u(X); ¢ = 1) =E,[w(X)) = u and Sj (u; ¢ = 1) = Ep[Xjuj(X)]. 


If we want to consider a quantile risk measure, called value-at-risk (VaR), we choose 
a Dirac measure for the density ¢. That is, choose a point measure of mass | in 
a € (0, 1), i.e., the density ¢ is concentrated in the single point a. In that case, the 
event {Fux (u(X)) = Unx) = a} receives probability one, and therefore we have 
the w-quantile 


o(u(X); a) = Fry (@), 


and the corresponding sensitivities for 1 < j < q 


Si(us = Bp [Xjnj(X) |wX) = Fy) |. (7.75) 


Remarks 7.36 


e In the introduction to this section we have assumed that u (X) has a continuous 
distribution function. This emphasizes that this sensitivity analysis is most 
suitable for continuous feature components. Categorical and discrete feature 
components can be embedded into a Euclidean space, e.g., using embedding 
layers, and then they can be treated as continuous variables. 

e Sensitivities (7.75) respect the local portfolio structure as they are calculated 
w.r.t. p. 

e In applications, we will work with the empirical portfolio distribution for p 
provided by (x;)1<j<n. This gives an empirical approximation to (7.75) and, 
in particular, it will require a choice of a bandwidth for the evaluation of the 
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conditional probability, conditioned on the event {u(X) = F ee (a)}. This is 
done with a local smoother similarly to Listing 7.8. 


In analogy to Merz et al. [273] we give a different interpretation to the 
sensitivities (7.75), which allows us to further expand this formula. We have Ist 
order Taylor expansion 


w(x + €) = w(x) + (Ven(x)) € + o (llell2) for |jell2 > 0. 


Obviously, this is a local approximation in x. Setting € = —x, we get (a possibly 
crude) approximation 


u0) ~ w(x) — (Veu(x))! x. 


By bringing the gradient term to the other side, using (7.75) and conditionally 
averaging, we receive the Ist order marginal attributions 


q 
Fry @) = E, [u W kO = Fro] = OE Sma 070 
j=l 


Thus, the sensitivities S; (44; œ) provide a Ist order description of the quantiles 
Foo (@) of u(X). We call this approach marginal attribution by conditioning on 
quantiles (MACQ) because it shows how the components X; of X contribute to a 
given quantile level. 


Example 7.37 (MACQ for Linear Regression) The simplest case is the linear 
regression case because the Ist order marginal attributions (7.76) are exact in this 
case. Consider a linear regression function with regression parameter B € R1+! 


q 
x > (x) = (B, x)= Bot >> bjxj. 
j=1 
The Ist order marginal attributions for fixed a € (0, 1) are given by 


q 
Fry) (@) =u 0) +J Sju a) 
j=l 


q 
= fot SBE, |X; © = Fr] am 
j=l 


That is, we replace the feature components X ; by their expected contributions on 


a given quantile level F ee) in (7.77). We compare this explanation to the ALE 


370 7 Deep Learning 


profile (7.70). Set initial value xj, = 0, the ALE profile for the linear regression 
model is given by 


Xj 
xj |> i A j(zj)dzj = Bjxj. 
0 


This is the sensitivity of the linear regression function in component xj, 
whereas (7.77) describes the contribution of each feature component to an expected 
response level u(x), in particular, E,[X j|u(X) = F nD) (a)] describes the average 
feature value in component j on a given quantile level. a 


A natural next step is to expand the Ist order attributions to 2nd orders. This 
allows us to consider the interaction terms. Consider the 2nd order Taylor expansion 


1 
w(x + €) = We) + (Ven (e)) "e+ se" Ven)e + ollel) for jelz > 0. 
Similar to (7.76), setting € = —x, this gives us the 2nd order marginal attributions 


q q 
_ 1 
Fray) © HO + DY) Suse — 5D) Tis a) (7.78) 
j=l j,k=1 


1 
= pu (0) + > (si a) — 5 AC a) = 5 Tj k (u; œ), 


=i I< j<k<q 


where for 1 < j,k < q we define uj, (x) = Ox; Ox, u(x), see (7.72), and 


Ty Hs a) = Ep | X; Xin ja W(X) = Fy |. (7.79) 


Remarks 7.38 


e The first line of (7.78) separates the Ist order attributions from the 2nd order 
attributions, the second line splits w.r.t. the individual component j attributions 
and the interaction attributions j Æ k. 

e The Ist order attributions (7.75) have been motivated by considering the direc- 
tional derivatives of the VaR distortion risk measure. Unfortunately, the 2nd order 
consideration has no simple equivalent motivation, as the 2nd order directional 
derivatives are much more involved, even in the linear case, we refer to Property 
1 in Gourieroux et al. [167]. 
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e Interestingly, we can precisely evaluate the accuracy of approximation (7.78) by 
analyzing for a given regression function u(-) 


q q 
= 1 
sup |F- uO- Sa D, Tj 7.80) 
aeé(0, 1) j=l jk=l 
Intuitively, in order to have a uniform good approximation, the origin 0 should be 
somehow centered in the feature distribution X ~ p. This will be studied next. 


Above we have implicitly assumed that 0 is a suitable reference point that makes 
the approximation error (7.80) small. For FN network fitting we typically normalize 
the features either using the MinMaxScaler (7.29) or we center and normalize the 
components of (X;)1<j<n according to (7.30). That is, the reference point is chosen 
such that the gradient descent fitting works efficiently. However, this may not be 
an optimal reference point for studying the 2nd order attributions. Therefore, we 
analyze this question in more detail, and the following reparametrization can still be 
done after model fitting. 

If we choose an arbitrary translation a € R1, we can set € = a — x in the 
above 2nd order Taylor expansion to receive another 2nd order marginal attribution 
representation 


Frig (@) ~ u (a) — Ep | (a = X) Va (X [h = Fy @ | (1.81) 


1 
=5Ep | @— OT Vua — X) [u = Fro |. 


Essentially, this means that we shift the feature distribution p to considering the 
shifted random vectors X“ = X — a and while setting uw*(-) = u(a + -), 
thus, this simply says that we pre-process the features differently. In view of 
approximation (7.81) we can now select a reference pointa € R1 that makes the 2nd 
order marginal attributions as precise as possible. Define the events A; = {u (X) = 
Fy @))} for a discrete quantile grid 0 < a; < ... < ay < 1. We define the 
objective function 


L 


ar Gan => (Faken - u (@) + Bp [(@ - DVD| Ai] (7.82) 
i=] 


$ ; tp [a -DT Vua — X)"| AV] j. 


Making this objective function G(a; u) small in a will provide us with a good 
reference point for the selected quantile levels (a7) 1<;<,; this is exactly the MACQ 
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proposal of Merz et al. [273]. A local minimum can be found by applying a gradient 
descent algorithm 


a” > aft) Z a® z 5:41VaG(a™; w), 


for tempered learning rates 6,4; > 0. The gradient of G w.r.t. a is given by 


L 
VaG(a; w =Y (Fron - u (0) +E, [(a— 9 Vru |a| 
l=1 


+E, [a — X)' Vžu(X)(a — x)"| Ai] ) 


x(- Van (a) + Ep [Vxu (X)| A] 
-E, [xT vu] Ai] +4 5a'Ep [vzn] Ai] ). 


All subsequent considerations and interpretations are done w.r.t. an optimal ref- 
erence point a € RI by minimizing the objective function (7.82) on the chosen 
quantile grid. Mathematically speaking, this optimal choice is w.l.0.g. because the 
origin 0 of the coordinate system of the feature space Æ is arbitrary, and any 
other origin can be chosen by a translation, see formula (7.81) and the subsequent 
discussion. For interpretations, however, the choice of the reference point a matters 
because the directional derivative X ; u j (X) can be small either because X ; is small 
or because u ;(X) is small. Having a small X ; means that this feature value is close 
to the chosen reference point. 


Example 7.39 (MACQ Analysis) We revisit the MTPL claim frequency example 
using the FN network regression model of depth d = 3 having (q1, q2,q3) = 
(20, 15, 10) neurons. Importantly, we use the hyperbolic tangent as the activation 
function in the FN layers which provides smoothness of the regression function. 
Figure 7.40 shows the VPI plot of this fitted model. Obviously, the variable 
BonusMalus plays the most important role in this predictive model. Remark that 
the VPI plot does not properly respect the dependence structure in the features as it 
independently permutes each feature component at a time. The aim in this example 
is to determine variable importance by doing the MACQ analysis (7.78). 

Figure 7.44 (lhs) shows the empirical density of the fitted canonical parameter 
0(xi), 1 < i < n; all plots in this example refer to the canonical scale. We then 
minimize the objective function (7.82) which provides us with an optimal reference 
point a € R1; we choose equidistant quantile grid 1% < 2% < ... < 99% 
and all conditional expectations in VqG(a; u) are empirically approximated by a 
local smoother similar to Listing 7.8. Figure 7.44 (rhs) gives the resulting marginal 
attributions w.r.t. this reference point. The orange line shows the Ist order marginal 
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Fig. 7.44 (lhs) Empirical density of the fitted canonical parameter 0(x;), 1 <i < n, (ths) Ist and 
2nd order marginal attributions 
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Fig. 7.45 (lhs) Second order marginal attributions S; (u; œ) — IT), j(u; æ) excluding interaction 
terms, and (rhs) interaction terms —}T; (m; a), j k 


attributions (7.76), and the red line the 2nd order marginal attributions (7.78). The 
cyan line drops the interaction terms T; (u; œ), j # k, from the 2nd order marginal 
attributions. From the shaded cyan area we see the importance of the interaction 
terms. We note that the 2nd order marginal attributions (red line) match the true 
empirical quantiles (black dots) quite well for the chosen reference point a. 

Figure 7.45 gives the 2nd order marginal attributions S$; (jw; œ) — 5Tj,j (u; œ) of 
the individual components 1 < j < q on the left-hand side, and the interaction terms 
-4 ik (3 œ), j # k on the right-hand side. We identify the following components 
as being important BonusMalus, DrivAge, VehGas, VehBrand and Region; 
these components show a behavior substantially different from being equal to 0, i.e., 
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2nd order attributions attributions on different quantile levels 
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Fig. 7.46 (lhs) Second order marginal attributions S;(u; a) — 5 y 1 Tj,k(u; œ) including 
interaction terms, and (rhs) slices at the quantile levels a € {20%, 40%, 60%, 80%} 


these components differentiate from the reference point a. These components also 
have major interactions that contribute to the quantiles above the level 80%. 

If we allocate the interaction terms to the corresponding components 1 < j < q 
we receive the second order marginal attributions Sj (jw; œ) — i D 1 Tjk (u; œ). 
These are illustrated in Fig. 7.46 (lhs) and the quantile slices at the levels a € 
{20%, 40%, 60%, 80%} are given in Fig. 7.46 (rhs). These graphs illustrate variable 
importance on different quantile levels (and respecting the dependence within 
the features). In particular, we identify the main variables that distinguish the 
given quantile levels from the reference level 0(a), i.e., Fig. 7.46 (rhs) should be 
understood as the relative differences to the chosen reference level. Once more we 
see that BonusMalus is the main driver, but also other variables contribute to the 
differentiation of the high quantile levels. 

Figure 7.47 shows the individual attributions x;,;j;(x;) of 1’°000 randomly 
selected cases x; for the feature components j = BonusMalus, DrivAge, 
VehGas, VehBrand; the colors illustrate the corresponding feature values x;, j 
of the individual car drivers i, and the black solid line corresponds to S$; (uw; a) — 
3i; j(u; œ) excluding the interaction terms (the black dotted line is one empir- 
ical standard deviation around the black solid line). Focusing on the variable 
BonusMalus we observe that the lower quantiles are almost completely domi- 
nated by insurance policies on the lowest bonus-malus level. The bonus-malus levels 
70-80 provide little sensitivity (are concentrated around the zero line) because the 
reference point a reflects these bonus-malus levels, and, finally, the large quantiles 
are dominated by high bonus-malus levels (red dots). 

The plot of the variable DrivAge is interpreted similarly. The reference point 
a is close to the young drivers, therefore, young drivers are concentrated around 
the zero line. At the low quantile levels, higher ages contribute positively to the 
low expected frequencies, whereas these ages have an unfavorable impact at higher 
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Fig. 7.47 Individual attributions x;,;4;(x;) of 1°000 randomly selected cases x; for j = 
BonusMalus, DrivAge, VehGas, VehBrand; the plots have different y-scales 


quantile levels (this should be considered in combination with their bonus-malus 
levels). We also observe a few outliers in this plot, for instance, we can identify a 
driver of age 20 at a quantile level of 20%. Further inspection of this driver raises 
some doubts whether this data is correct since this driver is at a bonus-malus level 
of 68% (which should technically not be possible) and she/he has an exposure of 2 
days. Surely, this insurance policy would need further investigation. 

The plot of VehGas shows that the chosen reference level O(a) is closer to 
Diesel fuel cars as the red dots fluctuate less around the zero line; in different 
runs of the gradient descent algorithm (with different seeds) this order has been 
changing (as it depends on the reference point a). We skip a detailed analysis of the 
variable VehBrand. E 
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7.7 Lab: Analysis of the Fitted Networks 


In the previous section we have studied some model-agnostic tools that can be used 
for any (differentiable) regression model. In this section we give some network 
specific plots. For simplicity we choose one specific example, namely, the FN 


7Fm= 


network ji = z=! of Table 7.9. We start by analyzing the learned representations 
in the different FN layers, this links to our introduction in Sect. 7.1. 

For any FN layer 1 < m < d we can study the learned representations 
z™) (x). For Fig.7.48 we select at random 1’000 insurance policies x;, and the 
dots show the activations of these insurance policies in neurons j = 4 (x-axis) 
and j = 9 (y-axis) in the corresponding FN layers. These neuron activations are 
in the interval (—1, 1) because we work with the hyperbolic tangent activation 
function for @. The color scale shows the resulting estimated frequencies @(x;) of 
the selected policies. We observe that the layers are increasingly (in the depth of the 
network) separating the low frequency policies (light blue-green colors) from the 
high frequency policies (red color). This is a quite typical picture that we obtain 
here, though, this sparsity in the 3rd FN layer is not the case for every neuron 
l<j < qa. 

In higher dimensional FN architectures it will be difficult to analyze the learned 
representations on each individual neuron, but at least one can try to understand 
the main effects learned. For this, on the one hand, we can focus on the important 
feature components, see, e.g., Sect. 7.6.1, and, on the other hand, we can try to study 
the main effects learned using a PCA in each FN layer, see Sect. 7.5.3. Figure 7.49 
shows the singular values à; > Az > ... = Aq,, > 0 in each of the three FN layers 
1 <m < d = 3; we center the neuron activations to mean zero before applying 
the SVD. These plots support the previously made statement that the layers are 
increasingly separating the high frequency from the low frequency policies. An 
elbow criterion tells us that in the first FN layer we have 8 important principal 
components (out of 20), in the second FN layer 3 (out of 15) and in the third FN 
layer 1 (out of 10). This is also reflected in Fig. 7.48 where we see more and more 
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Fig. 7.48 Observed activations in the three FN layers m = 1,2,3 (left-middle-right) in the 
corresponding neurons j = 4, 9, the color key shows the estimated frequencies A(x; ) 
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Fig. 7.49 Singular values à; > à2 >... = Ag,, > Oin the FN layers 1 < m < d=3 


concentration in the neuron activations. It is important to notice that the chosen 
FN network calibration f does not involve any drop-out layers during the gradient 
descent fitting, see Sect. 7.4.1. Drop-out layers prevent individual neurons to over- 
train to a specific task. Consequently, we will receive a network calibration that is 
more equally balanced across all neurons under drop-outs, because if one neuron 
drops out, the composite of the remaining neurons needs to be able to take over the 
task of the dropped out neuron. This leads to less sparsity and to singular values that 
are more similarly sized. 

In Fig. 7.50 we analyze the first two principal components in each FN layer, 
thus, these are the two principal components that correspond to the two biggest 
singular values (A;, 42) in each of the three FN layers. The first row shows the 
input variables (BonusMalus, DrivAge) € [50,125] x [18,90] of the 1’000 
randomly selected policies x;; these are the two most important feature components 
according to the VPI analysis. All three columns show the same data, however, in 
different color scales: (lhs) gives the color scale £, (middle) gives the color scale 
BonusMalus, and (rhs) gives the color scale DrivAge. These color scales also 
apply to the other rows. The 2nd row shows the first two principal components in 
the Ist FN layer, the 3rd row in the 2nd FN layer, and the last row in the third 
FN layer. Focusing on the first column we observe that the layers cluster the high 
and the low frequency policies in the Ist principal component more and more 
across the FN layers. Not surprisingly this leads to a quite clear-cut separation 
w.r.t. the bonus-malus level which can be verified from the second column of 
Fig. 7.50. For the driver’s age variable this sharp separation gets lost across the 
layers, see third column of Fig. 7.50, which indicates that the variable DrivAge 
does not influence the frequency monotonically and it interacts with the variable 
BonusMalus. 

Figure 7.51 shows the second order marginal attributions (7.78) for the different 
inputs. The graph on the left-hand side shows the plot w.r.t. the original inputs 
xi, the graph in the middle w.r.t. the learned representations zP (x;) € R% 
in the first FN layer, and on the right-hand side w.r.t. the learned representations 
z)(x;) € RE in the second FN layer. We interpret these plots as follows: the 
FN network disentangles the different effects through the FN layers by making 
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Fig. 7.50 (First row) Input variables (BonusMalus, DrivAge), (Second-fourth row) first two 
principal components in FN layers m = 1, 2, 3; (lhs) gives the color scale of estimated frequency 
i, (middle) gives the color scale BonusMalus, and (rhs) gives the color scale DrivAge 
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Fig. 7.51 Second order marginal attributions: (lhs) w.r.t. the input layer x € R®, (middle) 
w.r.t. the first FN layer z{ŒD(x) © R”, and (rhs) w.r.t. the second FN layer zD (x) € R2 


the plots more smooth and making the interactions between the neurons smaller. 
Note that the learned representations z°°')(x;) € RB in the last FN layer go into 
a classical GLM for the output layer, which does not have any interactions in the 
canonical predictor (because it is additive on the canonical scale), thus, being of 
the same type as the linear regression of Example 7.37. In the Poisson model with 
the log-link function, the interactions can only be of a multiplicative type in GLMs. 
Therefore, the network feature-engineers the input x; (in an automated way) such 
that the learned representation z” (x;) in the last FN layer is exactly in this GLM 
structure. This is verified by the small interaction part in Fig. 7.51 (rhs). This closes 
this part on model-agnostic tools. 
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Chapter 8 A 
Recurrent Neural Networks heal for 


Chapter 7 has discussed fully-connected feed-forward neural (FN) networks. Feed- 
forward means that information is passed in a directed acyclic path from the input 
layer to the output layer. A natural extension is to allow these networks to have 
cycles. In that case, we call the architecture a recurrent neural (RN) network. A RN 
network architecture is particularly useful for time-series modeling. The discussion 
on time-series data also links to Sect. 5.8.1 on longitudinal and panel data. RN 
networks have been introduced in the 1980s, and the two most popular RN network 
architectures are the long short-term memory (LSTM) architecture proposed by 
Hochreiter-Schmidhuber [188] and the gated recurrent unit (GRU) architecture 
introduced by Cho et al. [76]. These two architectures will be described in detail 
in this chapter. 


8.1 Motivation for Recurrent Neural Networks 


We start from a deep FN network providing the regression function, see (7.2)—(7.3), 


x w(x) =g (B, z" (x)), (8.1) 


with a composition z4:)) of d FN layers z™),1 <m < d, link function g and with 
output parameter B € R44+1, In principle, we could directly use this FN network 
architecture for time-series forecasting. We explain here why this is not the best 
option to deal with time-series data. 

Assume we want to predict a random variable Yr+; at time T > 0 based on the 
time-series information x9, x1, ..., x7. This information is assumed to be available 
at time T for predicting the response Yr. The past response information Y;, 1 < 
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t < T, is typically included in x;.! Using the above FN network architecture we 
could directly try to predict Yr, based on this past information. Therefore, we 
define the feature information xo.7 = (xo0,...,x7) and we aim at designing a FN 
network (8.1) for modeling 


xor er (xo:7r) = E[Yriilxo.7] = ElYr+ilxo,..., x7]. 


In principle we could work with such an approach, however, it has a couple 
of severe drawbacks. Obviously, the length of the feature vector xo.7 depends 
on time 7, that is, it will grow with every time step. Therefore, the regression 
function (network architecture) x9:7 +> ur (xo:r) is time-dependent. Consequently, 
with this approach we have to fit a network for every T. This deficiency can be 
circumvented if we assume a Markov property that does not require of carrying 
forward the whole past history. Assume that it is sufficient to consider a history of 
a certain length. Choose t > 0 fixed, then, for T > t, we can set for the feature 
information x7_7:7 = (Xr_—1,...,X7), which has a fixed length t + 1 > 1, now. 
In this situation we could try to design a FN network 


Xpinr > H(ærT-r:T) = ElYryilær-rr] = ElYrsilxr—z,..., x7]. 


This network regression function can be chosen independent of T since the relevant 
history x 7_;:7 always has the same length t + 1. The time variable T could be used 
as a feature component in x 7_;.7. The disadvantage of this approach is that such 
a FN network architecture does not respect the temporal causality. Observe that we 
feed the past history into the first FN layer 


XT-1:T > ZV (xrrr) € {1} x R”. 


This operation typically does not respect any topology in the time index of 
XT—r+1:T. Thus, the FN network does not recognize that the feature x;_; has been 
experienced just before the next feature x;. For this reason we are looking for a 
network architecture that can handle the time-series information in a temporal causal 
way. 


' More mathematically speaking, we assume to have a filtration (A;);>0 on the probability space 
(Q, A, P). The basic assumption then is that both sequences (x+); and (Y;), are (A,);-adapted, and 
we aim at predicting Y7+ 1, based on the information Apy. In the above case this information Apr is 
generated by xo, X1,...,x7, where x; typically includes the observation Y;. We could also shift 
the time index in x; by one time unit, and in that case we would assume that (x;); is previsible 
w.r.t. the filtration (A;);. We do not consider this shift in time index as it only makes the notation 
unnecessarily more complicated, but the results remain the same by including the information 
correspondingly into the features. 
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8.2.1 Recurrent Neural Network Layer 


We explain the basic idea of RN networks in a shallow network architecture, and 
deep network architectures will be discussed in Sect. 8.2.2, below. We start from the 
time-series input variable xor = (Xo0,...,X7), all components having the same 
structure x, € ¥ C {1} x R®,0 < t < T. The aim is to design a network 
architecture that allows us to predict the random variable Y7+ 1, based on this time- 
series information xo.7. 

The main idea is to feed one component x; of the time-series xo.7 at a time into 
the network, and at the same time we use the output z;_; of the previous loop as 
an input for the next loop. This variable z;_; carries forward a memory of the past 
variables x9:;—1. We explain this with a single RN layer having qı € N neurons. A 
RN layer is given (recursively) by a mapping, t > 1, 


2) : {1} x R® x RY > RY, (8.2) 
(Xr, Zr-1) > Zt = zi) (Xr, Zr-1), 
where the RN layer z“!) has the same structure as the FN layer given in (7.5), but 


based on feature input (xz, Z:-1) E€ X x RI! C {1} x R® x R, and not including 
an intercept component {1} in the output. 


More formally, a RN layer with activation function ġ is a mapping 
Z) : {1} x R® x R! > RM (8.3) 
(xz) 2% (x, 2) = Ge. Zeca Ce d), 
having neurons, 1 < j < q1, 
Ze z= (fw, x) + o z)) : (8.4) 


for given network weights ae e RO+! and u\ e R”. 


Thus, the FN layers (7.5)-(7.6) and the RN layers (8.3)-(8.4) are structurally 
equivalent, only the input x € X is adapted to the time-series structure (x;, Z;-1) € 
X x R1. Before giving more interpretation and before explaining how this single 
RN network structure can be extended to a deep RN network we illustrate this RN 
layer. 


384 8 Recurrent Neural Networks 
RN layer 


time-series z(1) (£t, zt—1) 
input £ processing 


input (£t, Zt—1) 


Fig. 8.1 RN layer z” processing the input (x;, 2/1) 
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Fig. 8.2 Unfolded representation of RN layer z™ processing the input (x;, 2-1) 


Figure 8.1 shows an RN layer z“) processing the input (xz, z;—1), see (8.2). From 
this graph, the recurrent structure becomes clear since we have a loop (cycle) feeding 
the output z; back into the RN layer to process the next input (¥;+1, Zt). 

Often one depicts the RN architecture in a so-called unfolded way. This is done 
in Fig. 8.2. Instead of plotting the loop (cycle) as in Fig. 8.1 (orange arrow in the 
colored version), we unfold this loop by plotting the RN layer multiple times. Note 
that this RN layer in Fig. 8.2 uses always the same network weights w” andu”, 
1 < j < qı, for all t. Moreover, the use of the colors of the arrows (in the colored 
version) in the two figures coincides. 


Remarks 8.1 


e The neurons of the RN layer (8.4) have the following structure 


qo qı 
ad) = () d) = (1) d) (1) 
a; (x,z)=¢ ((w' x) + (u; .2)) =o (ws + > wr xt Sia) : 
l=1 l=1 
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The network weights W® = (w Jejen e R(@+~4 include an intercept 


component wh.) and the network weights U® = (uu) <j<q, e R@*4 do not 
include an intercept component, otherwise we would have a redundancy. 

¢ The RN network architecture generates a new process (Z;);. This process encodes 
the part of the past history (x0:+); which is relevant for forecasting the next step. 
Thus, (z;); can be interpreted as a (latent) memory process, or as the process of 
learned (relevant) time-series representation giving us Z; = Z;(Xo0:1). 


e The same activation function ¢ and the same network weights (Ww )1<j<q and 


(u y igi , are shared across all time periods t > 0. This means that we assume 
a stationary (stochastic) process. 

e The upper index ‘) indicates the fact that this is the first (and single) RN layer 
in this example. In this sense, Figs. 8.1 and 8.2 show a shallow RN network. In 
the next section we are going to discuss deep RN networks, and below we are 
also going to discuss how the output is modeled, i.e., how the response Yr+1 is 
predicted based on the pre-processed features (z;)o<:<r € RY x(T+1) 


8.2.2 Deep Recurrent Neural Network Architectures 


There are many different ways of extending a shallow RN network to a deep RN 
network. Assume we want to model a RN network of depth d > 2. A first (obvious) 
way of receiving a deep RN network architecture is 


zH = 2) (x, al) e R”, (8.5) 


E T a 6) 


where all RN layers z™, 1 < m < d, are of type (8.3)-(8.4), and additionally we 
include an intercept component in the RN layers z™, 2 < m < d. We add the 
upper indices (in square brackets [-]) to the time-series (zim), to indicate which 
RN layer outputs these learned representations (memory processes). In fact, we 
could also write z!”"!! instead of z!”!, because in z!”""! the feature input xo. has 
been processed through m RN layers z”, ..., z“. For simplicity, we just use the 


notation zim] = zim] (X0:1)- 
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We are going to use the following abbreviation for a RN layer m > 1 


zim] = zm) (raze = (CW zee Us cue) Ee Ri, 


(8.7) 
where the weights W®™ = w”, aa wi”) e R@m-1+)x4m include the 
intercept components, and the weights U™ = au”, cs rey E RImXam 


do not include any intercept components. The scalar product is understood 


column-wise in the weight matrices W and U®™ , and the activation @ is 


understood component-wise. Moreover, we initialize for the input zl = ie 
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Fig. 8.3 Unfolded representation of a RN network architecture of depth d = 2 


Figure 8.3 shows the RN network architecture of depth d = 2 defined in (8.5)-(8.6). 
The dimension of the input z = x; € X C {1} x R® is qo + 1, the first RN layer 
has qı neurons and the second RN layer q2 neurons. From this graph it becomes 
clear how a RN network architecture of any depth d € N can be constructed 
(recursively). 


Remark 8.2 There are many alternative ways in building deep RN networks. E.g., 
we can add a loop that connects the output of the second RN layer back to the first 
one 


1 1 2 
zH — 2@ (ael ai 


z = 22) (zs a) ; 


8.2 Plain-Vanilla Recurrent Neural Network 387 


or we can add a skip connection from the input variable x; to the second RN layer 
gil — 2) (xı. a) , 
zl = 22) (x, z, ai) . 


We refrain from explicitly studying such RN network variants any further. 


8.2.3 Designing the Network Output 


There remains to explain how to predict the response variable Yr+, based on 


the pre-processed features (memory processes) gin ety zM, outputted by the RN 
network of depth d > 1. Typically, only the final output of the last RN layer 
[d] [d] 


Zr = Zr (Kor) € IR“ is considered to predict the response Yr+1. We take this 
output and feed it into a FN network 7) : {1} x R% — {1} x R@ of depth 
D e€ N and with FN layers zZ™, 1 < m < D, given by (7.5). Moreover, we choose 
a strictly monotone and smooth link function g. 


This then provides us with the regression function, see (7.7)-(7.8), 


xor + ElYryilxor] = uor) = g7 (8,22 (claor))). 68 


Thus, we first process the time-series features xo.7 through a RN network 
to receive the learned representation zM (xor) € R% at time T. This learned 
representation is then used as a feature input to a FN network z:)) that allows 
us to predict the response Yr +1. This is illustrated in Fig. 8.4 for depth d = 1. 


Remarks 8.3 


e From the graph in Fig. 8.4 it also becomes apparent that we can consider different 
insurance policies 1 < i < n having different lengths of the corresponding his- 
tories x; T-y:r € R@tY)*G+), t; € {0,..., T}. The stationarity assumption 
allows us to enter the network in Fig. 8.4 at any time T — t;. The RN network 
encodes this history into a learned feature a (Xi,7—1;:7) which is then decoded 
by the FN network z:") to forecast Y; r41. 

e If there is additional insurance policy dependent feature information ¥; that 
is not of a time-series structure, we can concatenate the feature information 
gw (xi o:T), Xi) which then enters the FN network (8.8). 
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RN layer 


time-series z0) (ap_1, zl.) Supit 
input @p_1 processing Zp_y 


[1] ) 


input (@7_1,2p_9 


RN layer EN network 


F 7 =(D: 1 ar 
time-series 2) (ar, zl.) z(D:1) (20) prediction 
input x7 processing z processing Yr41 


f a) i 
input (£T, 2p, memory zy 


Fig. 8.4 Forecasting the response Yr,, using a RN network (8.8) based on a single RN layer 
d = 1 and on a FN network of depth D 


There remains to fit this network architecture having d RN layers and D FN 
layers to the available data. The RN layers involve the network weights W®™ e€ 
R@m-1+DX4m and U™ e Rim*4m, for 1 <m < d, and the FN layers involve the 
network weights (Oijen e Rn-1+Dx4m forl <m < D, and with qo = qa. 
Moreover, we have an output parameter B € R?*+!. The fitting is again done by a 
gradient descent algorithm minimizing the corresponding objective function. 

Assume we have independent (in 7) data (Y; T+1, Xi,0:T, Vi,r+1) of the cases 1 < 
i < n. We then assume that the responses Y; 71 can be modeled by a fixed member 
of the EDF having unit deviance d. We consider the deviance loss function, see (4.9), 


n 


— IS viTH l 
8 o DOr = z D o (Fira, noian), (8.9) 


i=1 


for the observations Yr41 = (Y1,7r+1,...,; Ynr+i)', and where V collects all the 
RN and FN network weights/parameters of the regression function (8.8). This model 
can now be fitted using a variant of the gradient descent algorithm. The variant 
uses back-propagation through time (BPTT) which is an adaption of the back- 
propagation method to calculate the gradient w.r.t. the network parameter #. 


8.2.4 Time-Distributed Layer 


There is a special feature in RN network modeling which is called a time-distributed 
layer. Observe from Fig. 8.4 that the deviance loss function (8.9) only focuses on the 
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final observation Y; 7+ 1. However, the stationarity assumption allows us to output 
and study any (previous) observation Y;;41,0 < t < T. A time-distributed layer 
considers applying the deep FN network (8.8) simultaneously at all time points 0 < 
t < T; simultaneously meaning that we use the same FN network weights for all t. 
The latter is justified under the assumption of having stationarity. 


This then provides us with the regressions 


xor > ElYrilxor] = Mor) = 87! (8,2 (exon) forallr > 0. (8.10) 


Figure 8.5 illustrates a time-distributed output where we predict (Y;+1); based on 

the history (xo:;);, and we always apply the same FN network z?:!) to the memory 
a = z (x04). 
A time-distributed layer changes the fitting procedure. Instead of considering 
the objective function (8.9) for the final observation Y; r+1, we now include all 
observations Y = (Y; t+1)0<t<T,1<i<n into the objective function. This results in 
studying the deviance loss function 


Vi t+1 
hI d 
p 


1 n 1 T 
Pi DD) = 2) ea Yi t41, Heo (Xi o1) )- 8.11 
gi ( ) n 2 T+1 2 ( i,t+1 10 (xi0:)) ( ) 


FN network 


output zDD (2il 2) prediction 
zit 2 processin; Yr_-1 
ti 


memory Z-p_»5 


RN layer FN network 
Fi 1 = ğ 1 ane 
time-series z0) (r1, zit], oütput aD (ail) prediction 
input @p_1 processing ZT- processing Yr 


. al 1 
input (£r—1, zit! ,) memory Zy] 


RN layer FN network 


time-series zO) (ar, zil) Pi z(D:1) (zll) prediction 
input ær processing z processing i YrT+1 
1 


input (£r, zl) memory Zy 


Fig. 8.5 Forecasting (Y¥;+1),; using a RN network (8.10) based on a single RN layer d = 1 and 
using a time-distributed FN layer for the outputs 
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Note that this can easily be adapted if the different cases 1 < i < n have different 
lengths in their histories. An example is provided in Listing 10.8, below. 


8.3 Special Recurrent Neural Networks 


In the plain-vanilla RN networks introduced above we have defined the memory 
processes D>, 1 < m < d, which encode the information history (x;)+>0 
through different RN layers in a temporal causal way. This is naturally done through 
the use of a time-series structure as illustrated, e.g., in Fig. 8.5. There are more 
specific RN network architectures that allow the memory processes to be of a long 
memory or a short memory type. In this section, we present the two most popular 
architectures that pay a special attention to the memory storage. This is the long 
short-term memory (LSTM) architecture introduced by Hochreiter-Schmidhuber 
[188] and the gated recurrent unit (GRU) architecture proposed by Cho et al. [76]. 


8.3.1 Long Short-Term Memory Network 


The LSTM network of Hochreiter-Schmidhuber [188] is the most commonly used 
RN network architecture. The LSTM network uses simultaneously three different 
activation functions for different purposes, the sigmoid and hyperbolic tangent 
activation functions, respectively, 


x =F 


1 
bo) = Type EOD and m0) = FI E LD, 


and a general activation function @ : R — R, see also Table 7.1. 

The LSTM network relies on several RN layers that are of the same structure 
as the plain-vanilla RN layer given in (8.7). We start by defining three different so- 
called gates that all have the RN layer structure (8.7). These three gates are used 
to model the memory cell of the LSTM network. Choose a layer index m > 1 and 


assume that gin 1 is modeled by the previous layer m — 1; for m = 1 we initialize 


z = x,. The three gates are then defined as follows, set t > 1: 


e The forget gate models the loss of memory rate 


pi — f™ (gi 1 gl mja of (w, z" 1] Neue”, a) e (0, 1)%, 


with the network weights we e R@m-1+)Dx4m and U y” e Rd and with 


the sigmoid activation function of = o, we also refer to (8.7). 
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e The input gate models the memory update rate 
jim = jm) (a Zn) = gi (w, aon 4 ae, zim) € (0,1), 


with the network weights wi”) e RY@m-1+) Xam and oy e Ri | and with 
the sigmoid activation function gi, = do. 
e The output gate models the release of memory information rate 


ofl = of) (Pe cll) — g2 (Cw, N + (um, N € 0, 1%, 
(8.12) 


with the network weights wi” e R@m-1+)x4m and uo” e Rm | and with 
the sigmoid activation function @? = ġo. 


These gates have outputs in (0,1), and they determine the relative amount of 
memory that is updated and released in each step. The so-called cell state process 


(cl’"!), is used to store the relevant memory. Given z”, zn a and d", the 
updated cell state is defined by 
Sear e ee) (8.13) 


=f! [m] O M + ji © tanh (wm, 2z [m— 1) 4 eo ze) E Rim, 


with the network weights We? e R@m1t+Dxam and UL? € RamX4m, and © 
denotes the Hadamard product. This defines how the memory (cell state) is updated 
and passed forward using the forget and the input gates f ra andi Pnl , respectively. 


The neuron activations zim are updated, given es z" , and e, by 


M = gm) (a0 cl ell) o og (d) e RM, (8.14) 


with the cell state c™! given in (8.13) and the output gate ol”! defined in (8.12). 
Figure 8.6% shows a LSTM cell (8.13)-(8.14) which includes four RN layers (8.7) 
for the forget gate f”, the input gate i”, the output gate o™ and in the cell 
state update (8.13). These RN layers are combined using the Hadamard product © 
resulting in the updated cell state c!”! and the learned representation z!”! both being 
functions of the inputs x9.;. 


? This figure is based on colah’s blog explaining LSTMs https://colah.github.io/posts/2015-08- 
Understanding-LSTMs/. 
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cell state cell state 


F 


Ptanh 


Fig. 8.6 LSTM cell z”) with forget gate dé, input gate ¢/ and output gate ø? 


Below, we are going to summarize the LSTM cell update (8.13)-(8.14) as 
follows 
(ae Tae a) => Ge em = gA ee C al aa ! 


(8.15) 


The update (8.15) involves the eight network weight matrices we ; wee ‘ wo, 


Wi © RG@m-1+1)x4m and oF, u”, u&™ us e RIm*am, Altogether we have 
4(qm—1 + 1+ qm)qm network parameters in each LSTM cell 1 < m < d. These 
are learned with the gradient descent method. Moreover, we need to initialize the 
LSTM cell update (8.15). From the previous layer m — 1 we have the input z”U 


which we initialize as zl = x; form = 1 andt > 0. The initial states an and 


ch" are usually set to zero. 


8.3.2 Gated Recurrent Unit Network 


The LSTM architecture of the previous section seems quite complex and involves 
many parameters. Cho et al. [76] have introduced the GRU architecture that is 
simpler and uses less parameters, but has similar properties. The GRU architecture 
uses two gates that are defined as follows for t > 1, see also (8.7): 
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¢ The reset gate models the memory reset rate 


rl) = pm) ‘Cums ar) = or (wi, oe) + (ue, Hey e (0,1), 
with the network weights W,”” e RO@m-1+Dx4m and UL © RI *4m, and with 
the sigmoid activation function ġ} = ġo. 

e The update gate models the memory update rate 


u"! = u™ (ia) = Ewa gh) € (0, 1)”, 


with the network weights wi”) € R@m-1+)x4m and ie e Rim*4m | and with 
the sigmoid activation function ¢4 = ġo. 


[m] [m—1] 


The neuron activations z; ° are updated, given z; and zim 


m by 
z" = im) m an) (8.16) 


=o dl 4 a-r") og (wm, dr} +u o (U, M) © Re, 


with the network weights W° e R@»-1+Dx4m and U™ e Ri *4m, and for a 
general activation function @. 

The GRU and the LSTM architectures are similar, the former using less parameters 
because we do not explicitly model the cell state process. For an illustration of a 
GRU cell we refer to Fig. 8.7. In the sequel we focus on the LSTM architecture; 


layer layer 


Fig. 8.7 GRU cell z“”) with reset gate ¢” and update gate ¢! 
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though the GRU architecture is simpler and has less parameters, it is less robust in 
fitting. 


8.4 Lab: Mortality Forecasting with RN Networks 


8.4.1 Lee—Carter Model, Revisited 


The mortality data has a natural time-series structure, and for this reason mortality 
forecasting is an obvious problem that can be studied within RN networks. For 
instance, the LC mortality model (7.63) involves a stochastic process (k;); that 
needs to be extrapolated into the future. This extrapolation problem can be done 
in different ways. The original proposal of Lee and Carter [238] has been to analyze 
ARIMA time-series models, and to use standard statistical tools, Lee and Carter 
found that the random walk with drift gives a good stochastic description of the 
time index process (k;);. Nigri et al. [286] proposed to fit a LSTM network to 
this stochastic process, this approach is also studied in Lindholm—Palmborg [252] 
where an efficient use of the mortality data for network fitting is discussed. These 
approaches still rely on the classical LC calibration using the SVD of Sect. 7.5.4, 
and the LSTM network is (only) used to extrapolate the LC time index process (kr). 

More generally, one can design a RN network architecture that directly processes 
the raw mortality data My; = Dx,t/ex,t, not specifically relying on the LC structure. 
This has been done in Richman—Wiithrich [316] using a FN network architecture, in 
Perla et al. [301] using a RN network and a convolutional neural (CN) network 
architecture, and in Schiirch-Korn [330] extending this analysis to the study of 
prediction uncertainty using bootstrapping. A similar CN network approach has 
been taken by Wang et al. [375] interpreting the raw mortality data of Fig. 7.32 
as an image. 


Lee—Carter Mortality Model: Random Walk with Drift Extrapolation 


We revisit the LC mortality model [238] presented in Sect. 7.5.4. The LC log- 
mortality rate is assumed to have the following structure, see (7.63), 


log(us?)) = aP + oP? 


for the ages x9 < x < x, and for the calendar years t € 7. We now add the upper 


indices (P? to consider different populations p. The SVD gives us the estimates a? ) 


EP and De based on the observed centered raw log-mortality rates, see Sect. 7.5.4. 
The SVD is applied to each population p separately, i.e., there is no interaction 
between the different populations. This approach allows us to fit a separate log- 


mortality surface estimate (log (jy? ) Dxo<x<xı;teT to each population p. Figure 7.33 
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shows an example for two populations p, namely, for Swiss females and for Swiss 
males. 

The mortality forecasting requires to extrapolate the time index processes 
EPer beyond the latest observed calendar year tı = max{7}. As mentioned in 
Lee—Carter [238] a random walk with drift provides a suitable model for modeling 
EP) 0 for many populations p, see Fig. 7.35 for the Swiss population. Assume 
that 


W)  t>0, (8.17) 


TP) _ FP) 
k A k; +e) > 


t+1 


with eP) ae N (6p, op). t > 1, having drift 6, € R and variance o? >0. 
Model assumption (8.17) allows us to estimate the (constant) drift 6, with MLE. 


For observations (Ki? ) teT we receive the log-likelihood function 


ti 
1 (Ap) _ 7) 2 
bp + Lg, Gp) = D> —log(v2r0p) — =o (P -RP - 6p) » 
t=to+l Pp 


with first observed calendar year tọ = min{7}. The MLE is given by 


PP) _ RP 


“MLE ti 
ô = ———_. 8.18 


This allows us to forecast the time index process for t > tı by 


RP = ae +- io 


We explore this extrapolation for different Western European countries from the 
HMD [195]. We consider separately females and males of the countries {AUT, BE, 
CH, ESP, FRA, ITA, NL, POR}, thus, we choose 2 - 8 = 16 different populations 
p. For these countries we have observations for the ages 0 = x9 < x < xı = 99 
and for the calendar years 1950 < t < 2018.° For the following analysis we choose 
T = {tọ < t < tı} = {1950 < t < 2003}, thus, we fit the models on 54 years 
of mortality history. This fitted models are then extrapolated to the calendar years 


2004 < t < 2018. These 15 calendar years from 2004 to 2018 allow us to perform 
(p), (p) 
= D Je 


an out-of-sample evaluation because we have the observations M ue et lei 


for these years from the HMD [195]. 
Figure 8.8 shows the estimated time index process EPer to the left of the 
dotted lines, and to the right of the dotted lines we have the random walk with 


drift extrapolation RP ) )t>r,- The general observation is that, indeed, the random 


P), 


walk with drift seems to be a suitable model for R Moreover, there is a huge 


3 We exclude Germany from this consideration of (continental) Western European countries 
because the German mortality history is shorter due to the reunification in 1990. 
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Fig. 8.8 Random walk with drift extrapolation of the time index process Ei for different 
countries and genders; the y-scale is the same in both plots 


similarity between the different countries, only with the Netherlands (NL) being 
somewhat an outlier. 


Remarks 8.4 


For Fig. 8.8 we did not explore any fine-tuning, for instance, the estimation of 
the drift 6, is very sensitive in the selection of the time span 7. ESP has the 
biggest negative drift estimate, but this is partially caused by the corresponding 
observations in the calendar years between 1950 and 1960, see Fig. 8.8, which 
may no longer be relevant for a decline in mortality in the new millennium. 

For all countries, the females have a bigger negative drift than the males (the 
y-scale in both plots is the same). Moreover, note that we use the normalization 
cae DP = | and Fier oP = 0, see (7.65). This normalization is discussed 
and questioned in many publications as the extrapolation becomes dependent on 
these choices; see De Jong et al. [90] and the references therein, who propose 
different identification schemes. 

Another issue is an age coherence in forecasting, meaning that for long term 
forecasts the mortality rates across the different ages should not diverge, see Li 
et al. [250], Li-Lu [248] and Gao-Shi [153] and the references therein. 

There are many modifications and extensions of the LC model, we just mention 
a few of them. Brouhns et al. [56] embed the LC model into a Poisson modeling 
framework which provides a proper stochastic model for mortality modeling. 
Renshaw—Haberman [308] extend the one-factor LC model to a multifactor 
model, and in Renshaw—Haberman [309] a cohort effect is added. Hyndman— 
Ullah [197] and Hainaut—Denuit [179] explore a functional data method and a 
wavelet-based decomposition, respectively. The static PCA can be adopted to 
a dynamic PCA version, see Shang [333], and a long memory behavior in the 
time-series is studied in Yan et al. [395]. 
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e The LC model is fitted to each population p separately, without exploring 
any common structure across the populations. There are many multi-population 
extensions that try to learn common structure across different populations. We 
mention the common age effect (CAE) model of Kleinow [218], the augmented 
common factor (ACF) model of Li—Lee [249] and the functional time-series 
models of Hyndman et al. [196] and Shang—Haberman [334]. A direct multi- 
population extension of the SVD matrix decomposition of the LC model is 
obtained by the tensor decomposition approaches of Russolillo et al. [325] and 
Dong et al. [110]. 


Lee—Carter Mortality Model: LSTM Extrapolation 


Our aim here is to replace the individual random walk with drift extrapola- 
tions (8.17) by a common extrapolation across all considered populations p. For 
this we design a LSTM architecture. A second observation is that the increments 
eP) = EP = RP have an average empirical auto-correlation (for lag 1) of —0.33. 
This clearly questions the Gaussian i.i.d. assumption in (8.17). 

We first discuss the available data and we construct the input data. We have 
the time-series observations EP er, and the population index p = (c, g) has 
two categorical labels c for country and g for gender. We are going to use two- 
dimensional embedding layers for these two categorical variables, see (7.31) for 
embedding layers. The time-series observations RP er will be pre-processed 
such that we do not simultaneously feed the entire time-series into the LSTM layer, 
but we divide them into shorter time-series. We will directly forecast the increments 
el? : EP _ k? A and not the time index process RP isn in extrapolations with 
drift it is easier to forecast the increments with the networks. We choose a lookback 


period of t = 3 calendar years, and we aim at predicting the response Y, = el? ) 


based on the time-series features x;_7--1 = (eP, ees eP)" € R”. This provides 


us with the following data structure for each population p = (c, g): 


year |country |gender|feature x;~7-+-1| Y; 


p) P) | P) 
to+r+1| c g [Enti °°? Entr Entr 
p) P | 0) (8.19) 
t c E ee af a” 
(p) m | D) 
ti c § |En-r > En-1 | Et 
Thus, each observation Y, = e? ) is equipped with the feature information 


(t, C, 8, Xt—r:t—1). As discussed in Lindholm—Palmborg [252], one should highlight 
that there is a dependence across t, since we have a diagonal cohort structure in the 


Nee e eee ee ee 
SOMIADNABWNHKTOMIDUNBWNHE 
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features and the observations (x1—r:1+—1, Y;). Usually, this dependence is not harmful 
in stochastic gradient descent fitting. 


Listing 8.1 LSTM architecture example 


TS = layer_input (shape=c (lookback, 1), dtype='float32', name='TS') 
Country = layer_input (shape=c (1), dtype='int32', name='Country’ ) 
Gender = layer_input (shape=c (1), dtype='int32', name=’ Gender’ ) 

Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’ ) 

# 


CountryEmb = Country %>% 
ayer_embedding(input_dim=8,output_dim=2,input_length=1,name=’CountryEmb’) %>% 
ayer _flatten(name=’Country flat’ 

# 
GenderEmb = Gender %>% 
ayer_embedding(input_dim=2,output_dim=2,input_length=1,name=’GenderEmb’) %>% 
ayer _flatten(name=’Gender flat’) 


2 
© 


H 
n 
4 
S 
I 
4 
u 
oe 

Av 


units=15,activation='tanh' ,recurrent_activation='sigmoid', 
name='LSTM’ ) 


Output = list (LSTM,CountryEmb,GenderEmb,Time) %>% layer_concatenate() %>% 
ayer _dense(units=10, activation=’tanh’, name=’FNLayer’) %>% 
ayer_dense(units=1, activation=’linear’, name=’Network’ ) 


# 
model = keras_model(inputs = list(TS, Country, Gender, Time), 
outputs = c(Output) 


In Fig. 8.9 we plot the LSTM architecture used to forecast eP fort > tı, and 


Listing 8.1 gives the corresponding R code. We process the time-series x;—r:1—1 
through a LSTM cell, see lines 14—16 of Listing 8.1. We choose a shallow LSTM 
network (d = 1) and therefore drop the upper index m = 1 in (8.15), but we add 
an upper index STI to highlight the output of the LSTM cell. This gives us the 


concatenation 
——>| into a shallow 
FN layer 


LSTM cell 


depth d= 1 


country embedding 
c layer 


gender embedding 
g layer 


Fig. 8.9 LSTM architecture used to forecast el? ) fort > ty 
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LSTM cell updates fort — t < s <t-1 


LSTM LSTM LSTM LSTM 
(agate Te) > Gs les) =z ga. lei) 


This LSTM recursion to process the time-series x;_7-r-1 gives us the LSTM output 
zILSTMI € R1, and it involves 4(qo + 1 + qi)gi = 4(2 + 15)15 = 1/020 network 
parameters for the input dimension go = 1 and the output dimension qı = 15. 

For the categorical country code c and the binary gender g we choose two- 


dimensional embedding layers, see (7.31), 
ch e&(c) € R? and gr e (g) cR, 


these embedding maps give us 2(8 + 2) = 20 embedding weights. Finally, we 
concatenate the LSTM output ge € R!5, the embeddings e€ (c), eS (g) € R? 
and the continuous calendar year variable t € R and process this vector through a 
shallow FN network with q2 = 10 neurons, see lines 18—20 of Listing 8.1. This FN 
layer gives us (qı +2 +2 + 1+ 1)q2 = (15+2 +2 + 1+ 1)10 = 210 parameters. 
Together with the output parameter of dimension q2 + 1 = 11, we receive 1’261 
network parameters to be fitted, which seems quite a lot. 

To fit this model we have 8 - 2 = 16 populations, and for each population we 


have the observations EP for the calendar years 1950 < t < 2003. Considering 
(p) 


the increments £; and a lookback period of t = 3 calendar years gives us 2003 — 
1950 — t = 50 observations, rows in (8.19), per population p, thus, we have in total 
800 observations. For the gradient descent fitting and the early stopping we choose a 
training to validation split of 8 : 2. As loss function we choose the squared error loss 
function. This implicitly implies that we assume that the increments Y, = eP ) are 
Gaussian distributed, or in other words, minimizing the squared error loss function 
means maximizing the Gaussian log-likelihood function. We then fit this model to 
the data using early stopping as described in (7.27). We analyze this fitted model. 
Figure 8.10 provides the learned embeddings for the country codes c. These learned 
embeddings have some similarity with the European map. 

The final step is the extrapolation k,, f > tı. These updates need to be done 
recursively. We initialize for t = tı + 1 the time-series feature 


E E a (8.20) 


Using the feature information (t1 + 1, c, g, X4 +1-r:t ) allows us to forecast the next 


increment Y, +1 = ae by Fai, using the fitted LSTM architecture of Fig. 8.9. 
Thus, this LSTM network allows us to perform a one-period-ahead forecast to 


receive 


Rai = ky + Yata: (8.21) 
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This update (8.21) needs to be iterated recursively. For the next period t = tı + 2 
we set for the time-series feature 


Cee e ea a e, (8.22) 


which gives us the next predictions Y,,42 and Ki 42s etc. 

In Fig. 8.11 we present the extrapolation of Ou » t for Belgium females and males. 
The blue curve shows the observed increments (el? ) )1951<r<2003 and the LSTM fit- 
ted (in-sample) values (Y;)1954<+<2003 are in red color. Firstly, we observe a negative 
correlation (zig-zag behavior) in both the blue observations (eo? )1951<1<2003 and 


in their red estimated means (Y;)1954<1<2003. Thus, the LSTM finds this negative 
correlation (and it does not propose i.i.d. residuals). Secondly, the volatility in the 
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Fig. 8.11 LSTM network extrapolation (Y, )t>1, for Belgium (BE) females and males 
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red curve is smaller than in the blue curve, the former relates to expected values and 
the latter to observations of the random variables (which should be more volatile). 
The light-blue color shows the random walk with drift extrapolation (which is just a 
horizontal straight line at level SMLE, see (8.18)). The orange color shows the LSTM 
extrapolation using the recursive one-period-ahead updates (8.20)-(8.22), which has 
a zig-zag behavior that vanishes over time. This vanishing behavior is critical and is 
going to be discussed next. 

There is one issue with this recursive one-period-ahead updating algorithm. This 
updating algorithm is not fully consistent in how the data is being used. The original 
LSTM architecture calibration is based on the feature components ef?) see (8.20). 
Since these increments are not known for the later periods t > tı, we replace 
their unknown values by the predictors, see (8.22). The subtle point here is that 
the predictors are on the level of expected values, and not on the level of random 
variables. Thus, Y; is typically less volatile than eP, but in (8.22) we pretend 
that we can use these predictors as a one-to-one replacement. A more consistent 
way would be to simulate/bootstrap eP ) from M E, a?) so that the extrapolation 
receives the same volatility as the original process. For simplicity we refrain from 
doing so, but Fig. 8.11 indicates that this would be a necessary step because the 
volatility in the orange curve is going to vanish after the calendar year 2003, i.e., the 
zig-zag behavior vanishes, which is clearly not appropriate. 

The LSTM extrapolation of Ro) is shown in Fig. 8.12. We observe quite some 
similarity to the random walk with drift extrapolation in Fig. 8.8, and, indeed, the 
random walk with drift seems to work very well (though the auto-correlation has not 
been specified correctly). Note that Fig. 8.8 is based on the individual extrapolations 
in p, whereas in Fig. 8.12 we have a common model for all populations. 

Table 8.1 shows how often one model outperforms the other one (out-of-sample 
on calendar years 2004 < t < 2018 and per gender). On the male populations of 
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Fig. 8.12 LSTM network extrapolation of Cay for different countries and genders 
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Table 8.1 Comparison of the out-of-sample mean squared error losses for the calendar years 
2004 < t < 2018: the numbers show how often one approach outperforms the other one on 
each gender 


Female Male 
Random walk with drift | 5/8 4/8 
LSTM architecture 3/8 4/8 


the 8 European countries both models outperform the other one 4 times, whereas 
for the female population the random walk with drift gives 5 times the better out-of- 
sample prediction. Of course, this seems disappointing for the LSTM approach. This 
observation is quite common, namely, that the deep learning approach outperforms 
the classical methods on complex problems. However, on simple problems, as the 
one here, we should go for a classical (simpler) model like a random walk with drift 
or an ARIMA model. 


8.4.2 Direct LSTM Mortality Forecasting 


The previous section has been relying on the LC mortality model and only the 
extrapolation of the time-series (ky) t has been based on a RN network architecture. 
In this section we aim at directly processing the raw mortality rates My = 
D, t/ex ı through a network, thus, we perform the representation learning directly 
on the raw data. We therefore use a simplified version of the network architecture 
proposed in Perla et al. [301]. 

As input to the network we use the raw mortality rates M,,. We choose a 
lookback period of t = 5 years and we define the time-series feature information to 
forecast the mortality in calendar year t by 


E€ REi-*0+D xt = R!00x5 


(8.23) 


Xperts = (Xz, ee Mr) = Ma) petting baie 


Thus, we directly process the raw mortality rates (simultaneously for all ages x) 
through the network architecture; in the corresponding R code we need to input the 
transposed features a e R°*! see line 1 of Listing 8.2. 

We choose a shallow LSTM network (d = 1) and drop the upper index m = 1 
in (8.15). This gives us the LSTM cell updates fort -t <s <t—1 


[LSTM] LSTM LSTM [LSTM] 
(ra ei) = (2! Lajas (a E 


This LSTM recursion to process the time-series x;_7-r-1 gives us the LSTM output 


zIST™I e R1, see lines 14-15 of Listing 8.2. It involves 4(qo + 1 + qiq1 = 


4(100 + 1 + 20)20 = 9'680 network parameters for the input dimension go = 100 
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Fig. 8.13 LSTM architecture used to process the raw mortality rates (Mx ,t)x,t 


and the output dimension qı = 20. Many statisticians would probably stop at this 
point with this approach, as it seems highly over-parametrized. Let’s see what we 
get. 

For the categorical country code c and the binary gender g we choose two one- 
dimensional embeddings, see (7.31), 


ch e(c)ER and gr eC (g) ER. (8.24) 


These embeddings give us 8 + 2 = 10 embedding weights. Figure 8.13 shows 
the LSTM cell in orange color and the embeddings in yellow color (in the colored 
version). 

The LSTM output and the two embeddings are then concatenated to a learned 
representation Z;-; = co. eC (c), eF(g))' e RUX!x! = R. The 22- 
dimensional learned representation z;-1; encodes the 500-dimensional input 
Xxı—rı—1 € [R!9*5 and the two categorical variables c and g. The last step 
is to decode this representation z,_; € R? to predict the log-mortality rates 
(Yx t)o<x<99 = (log Mx t)o<x<99 € R100, simultaneously for all ages x. This 
decoding is obtained by the code on lines 17—19 of Listing 8.2; this reads as 


a4 e (e? + Bee (c) + poeS(g) + (i) ee (8.25) 


This decoding involves another (1 + 22)100 = 2/300 parameters (6°, 6S, Bf, 
B..)o<x<99. Thus, altogether this LSTM network has r = 11'990 parameters. 
Summarizing: the above architecture follows the philosophy of the auto-encoder 
of Sect. 7.5. A high-dimensional observation (x;—1:1-1, €, g) is encoded to a low- 
dimensional bottleneck activation z;_1 € R~, which is then decoded by (8.25) 
to give the forecast (a iex one for the log-mortality rates. It is not precisely an 
auto-encoder because the response is different from the input, as we forecast the 
log-mortality rates in the next calendar year t based on the information z;—; that 
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Listing 8.2 LSTM architecture to directly process the raw mortality rates (Mx,t)x,t 


TS = layer_input (shape=c(lookback,100), dtype=’float32’, name=’TS’ ) 
Country = layer_input(shape=c(1), dtype=‘’int32’, name='Country’ ) 
Gender = layer_input(shape=c(1), dtype=’int32’, name='Gender’ ) 

Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’ ) 

# 


CountryEmb = Country %>% 

ayer_embedding(input_dim=8, output_dim=1,input_length=1,name=’CountryEmb’) %>% 

ayer flatten(name=’Country flat’) 

# 

GenderEmb = Gender %>% 

ayer_embedding(input_dim=2,output_dim=1,input_length=1,name=’GenderEmb’) %>% 

ayer_flatten(name=’Gender flat’ 

# 

LSTM = TS %>% 

ayer_lstm(units=20,activation=’linear’ ,recurrent_activation='sigmoid’, 
name=' LSTM’ ) 


# 

Output = list (LSTM,CountryEmb,GenderEmb) %>% layer _concatenate() %>% 
ayer _dense(units=100, activation=’linear’, name='’scalarproduct’) %>% 
ayer_reshape(c(1,100), name = ‘Output’) 

# 

model = keras_model(inputs = list(TS, Country, Gender), 


outputs = c(Output) ) 


is available at the end of the previous calendar year t — 1. In contrast to the LC 
mortality model, we no longer rely on the two-step approach by first fitting the 
parameters with a SVD, and performing a random walk with drift extrapolation. 
This encoder-decoder network performs both steps simultaneously. 

We fit this network architecture to the available data. We have r = 11/990 
network parameters. Based on a lookback period of t = 5 years, we have 2003 — 
1950—t+1 = 49 observations per population p = (c, g). Thus, we have in total 784 
observations (Xi—r:1-1, C, 8, (Yx,1)0<x<99). We fit this network using the nadam 
version of the gradient descent algorithm. We choose a training to validation split of 
8 : 2 and we explore 10’000 gradient descent epochs. A crucial observation is that 
the algorithm converges rather slowly and it does not show any signs of over-fitting, 
i.e., there is no strong need for the early stopping. This seems surprising because we 
have 11°990 parameters and only 784 observations. There are a couple of important 
ingredients that make this work. The features and observations themselves are 
high-dimensional, the low-dimensional encoding (compression) leads to a natural 
regularization, Moreover, this is combined with linear activation functions, see lines 
15 and 19 of Listing 8.2. The gradient descent fitting has a certain inertness, and 
it seems that high-dimensional problems on comparably smooth high-dimensional 
data do not over-fit to individual components because the gradients are not very 
sensitive in the individual partial derivatives (in high dimensions). These high- 
dimensional approaches only work if we have sufficiently many populations across 
which we can learn, here we have 16 populations, Perla et al. [301] even use 76 
populations. 

Since every gradient descent fit still involves several elements of randomness, 
we consider the nagging predictor (7.44), averaging over 10 fitted networks, see 
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Table 8.2 Comparison of the out-of-sample mean squared losses for the calendar years 2004 < 
t < 2018; the figures are in 1074 


LC female LSTM female LC male LSTM male 
Austria AUT 0.765 0.312 2.527 1.169 
Belgium BE 0.371 0.311 2.835 0.960 
Switzerland CH 0.654 0.478 | 1.609 1.134 
Spain ESP 1.446 0.514 1.742 0.245 
France FRA 0.175 1.684 0.333 0.363 
Italy ITA 0.179 0.330 0.874 0.320 
The Netherlands NL | 0.426 0.315 1.978 0.601 
Portugal POR 2.097 0.464 1.848 1.239 


Sect. 7.4.4. The out-of-sample prediction results on the calendar years 2004 to 
2018, i.e., t > ti = 2004, are presented in Table 8.2. These results verify the 
appropriateness of this LSTM approach. It outperforms the LC model on the female 
population in 6 out of 8 cases and on the male population on 7 out of 8 cases, 
only for the French population this LSTM approach seems to have some difficulties 
(compared to the LC model). Note that these are out-of-sample figures because 
the LSTM has only been fitted on the data prior to 2004. Moreover, we did not 
pre-process the raw mortality rates M,+, t£ < 2003, and the prediction is done 
recursively in a one-period-ahead prediction approach, we also refer to (8.22). A 
more detailed analysis of the results shows that the LC and the LSTM approaches 
have a rather similar behavior for females. For males the LSTM prediction clearly 
outperforms the LC model prediction, this out-performance is across different ages 
x and different calendar years t > 2004. 

The advantage of this LSTM approach is that we can directly predict by 
processing the raw data. The disadvantage compared to the LC approach is that the 
LSTM network approach is more complex and more time-consuming. Moreover, 
unlike in the LC approach, we cannot (easily) assess the prediction uncertainty. 
In the LC approach the prediction uncertainty is obtained from assessing the 
uncertainty in the extrapolation and the uncertainty in the parameter estimates, e.g., 
using a bootstrap. The LSTM approach is not sufficiently robust (at least not on our 
data) to provide any reasonable uncertainty estimates. 

We close this section and example by analyzing the functional form of the 
decoder (8.25). We observe that this decoder has much similarity with the LC model 
assumption (7.63) 


a 
Yx,t 


B? + Bre“ (e) + BEES) + (B E), 
logui) = ap HLP RP. 


The LC model considers the average force of mortality al? ) € R for each population 


p = (c, g) and each age x; the LSTM architecture has the same term 69+ B© e©(c)+ 
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BS eS(g). In the LC model, the change of force of mortality is considered by a 
population-dependent term bP ke ), whereas the LSTM architecture has a term 


(Bx A ). This latter term is also population-dependent because the LSTM cell 


directly processes the raw mortality data My ; coming from the different populations 
p. Note that this is the only time-t-dependent term in the LSTM architecture. We 
conclude that the main difference between these two forecast approaches is how the 
past mortality observations are processed. Apart from that the general structure is 
the same. 
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Chapter 9 A) 
Convolutional Neural Networks Geek for 


The previous two chapters have been considering fully-connected feed-forward 
neural (FN) networks and recurrent neural (RN) networks. Fully-connected FN 
networks are the prototype of networks for deep representation learning on tabular 
data. This type of networks extracts global properties from the features x. RN 
networks are an adaption of FN networks to time-series data. Convolutional neural 
(CN) networks are a third type of networks, and their specialty is to extract local 
structure from the features. Originally, they have been introduced for speech and 
image recognition aiming at finding similar structure in different parts of the feature 
x. For instance, if x is a picture consisting of pixels, and if we want to classify 
this picture according to its contents, then we try to find similar structure (objects) 
in different locations of this picture. CN networks are suitable for this task as 
they work with filters (kernels) that have a fixed window size. These filters then 
screen across the picture to detect similar local structure at different locations in 
the picture. CN networks were introduced in the 1980s by Fukushima [145] and 
LeCun et al. [234, 235], and they have been celebrating great success in many 
applications. Our introduction to CN networks is based on the tutorial of Meier- 
Wüthrich [269]. For real data applications there are many pre-trained CN network 
libraries that can be downloaded and used for several different tasks, an example for 
image recognition is the AlexNet of Krizhevsky et al. [226]. 


9.1 Plain-Vanilla Convolutional Neural Network Layer 


Structurally, the CN network architectures are similar to the FN network architec- 
tures, only they replace certain FN layers by CN layers. Therefore, we start by 
introducing the CN layer, and one should keep the structure of the FN layer (7.5) 
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in mind. In a nutshell, FN layers consider non-linearly activated inner products 


(m) 
(w; 
wo * Z. 


,Z), and CN layers replace these inner products by a type of convolution 


9.1.1 Input Tensors and Channels 


: Dyg K) f 
We start from an input tensor z € RI *™*41" that has dimension q‘) x- - -xq ®. 


This input tensor z is a multi-dimensional array of order (length) K € N and with 
elements zj,,..,i, € R for 1 < ik < gq’ and 1 < k < K. The special case of order 
K = 2is a matrix z € R1” *4® , This matrix can illustrate a black and white image 
of dimension q“) x q® with the matrix entries z; ni, E€ R describing the intensities 
of the gray scale in the corresponding pixels (i1, i2). A color image typically has 
the three color channels Red, Green and Blue (RGB), and such a RGB image can 
be represented by a tensor z € RI xa xq of order 3 with gq) x q™ being 
the dimension of the image and q°) = 3 describing the three color channels, i.e., 
(Zint Zi Zit .i2,3) | € R? describes the intensities of the colors RGB in the 
pixel (i4, i2). 

Typically, the structure of black and white images and RGB images is unified by 
representing the black and white picture by a tensor z € RI xd xq” of order 3 
with a single channel g°) = 1. This philosophy is going to be used throughout this 
chapter. Namely, if we consider a tensor z € RIP x xg xg OF order K, the 
first K — 1 components (i1, ..., ix—1) will play the role of the spatial components 
that have a natural topology, and the last components 1 < ig < q‘*? are called 
the channels reflecting, e.g., a gray scale (for gq“) = 1) or the RGB intensities (for 
q% = 3). 

In Sect. 9.1.3, below, we will also study time-series data where we have 2nd 
order tensors (matrices). The first component reflects time 1 < t < gq, i.e., 
the spatial component is temporal for time-series data, and the second component 
(channels) describes the different elements Z; = (Z;,1,..-, Zeg) € RI” that are 
measured/observed at each time point t. 


9.1.2 Generic Convolutional Neural Network Layer 


We start from an input tensor z € Ron1%"X4n—1 of order K. The first K — 1 
components of this tensor have a spatial structure and the K-th component stands 
for the channels. A CN layer applies (local) convolution operations to this tensor. We 
choose a filter size, also called window size or kernel size, fF, sey FENT e NX 
with fP < q® for! < k < K- 1, and ff = ee This filter size determines 
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the output dimension of the CN operation by 


def. k 
qh ZS qe 1, (9.1) 


m — am- 

for 1 < k < K. Thus, the size of the image is reduced by the window size of 
the filter. In particular, the output dimension of the channels component k = K 
is q£ ) = 1, i.e., all channels are compressed to a scalar output. The spatial 
components | < k < K — 1 retain their spatial structure but the dimension is 
reduced according to (9.1). 

A CN operation is a mapping (note that the order of the tensor is reduced from 
K to K — 1 because the channels are compressed; index j is going to be explained 
later) 


(K-1) 


ae RI 1* xg ) _5 Ran x “Xd (9.2) 


ieis ix-13J laam ;I<k<K— i 


taking the values for a fixed activation function ¢@ : R —> R 


fo 4 ® 
(m) (m) x (m) 
Zin ik = $ wyt’ od w a xij Sut L,.ik-1tlx-1-Llk | > 
H= igsi 
(9.3) 
for given intercept wy) € R and filter weights 
ww — (w ) € Rim XXS. (9 4) 
j Nel j) 1px fl; 1<k<K i 


the network parameter has dimension rm = 1 + Há, Ck) 


At first sight this CN operation looks quite complicated. Let us give some 
remarks that allow for a better understanding and a more compact notation. The 
operation in (9.3) chooses the corner (i1,...,ix—1, 1) as base point, and then it 
reads the tensor elements in the (discrete) window 


Gn- ig-11)+ [0: fo — 1] x +x [0 Tan 1] x [o: pi) i]. 
(9.5) 


with given filter weights Ww”. This window is then moved across the entire 
tensor z by changing the base point (i1,...,ix—1, 1) accordingly, but with fixed 
filter weights we. This operation resembles a convolution, however, in (9.3) the 
indices in Zj,47;—1 run in reverse direction compared to a classical 


yi ix-1tlx-1—-llx 
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(mathematical) convolution. By a slight abuse of notation, nevertheless, we use the 
symbol of the convolution operator * to abbreviate (9.2). This gives us the compact 
notation: 


K 1 K-1 
ze ROX XI» RON XI? 


ze i @)=o(wyy + Wz), 0.6) 
having the activations for 1 < iz < q®, 1<k<kK-—-1, 


(m) (m) _ _(m) 
o (w6) + m W *z), T. pa 


ein 


where the latter is given by (9.3). 


Remarks 9.1 


e The beauty of this notation is that we can now see the analogy to the FN layer. 
Namely, (9.6) exactly plays the role of a FN neuron (7.6), but the CN operation 
wy) + we x z replaces the inner product (w Ms ™ 
accounting for the intercept. 


e A FN neuron (7.6) can be seen as a special case of CN operation (9.6). Namely, 


,Z), and correspondingly 


if we have a tensor of order K = 1, the input tensor (vector) reads bs z € Rin-t, 
That is, we do not have a spatial component, but only qm-1 = =q% ı channels. 


In that case we have wW x = (W , Z) for the filter weights w™ e Raim- ) i; 
and where we assume that Zz ee not include an intercept component. Thus, the 
CN operation boils down to a FN neuron in the case of a tensor of order 1. 

e Inthe CN operation we take advantage of having a spatial structure in the tensor 
z, which is not the case in the FN operation. The CN operation takes a spatial 


input of dimension [JÄ a 7 


dimension Ti i ge. For this it uses rm = 1 + m 1 se filter weights. The 
FN operation takes an input of dimension qm—1 and it maps it to a 1-dimensional 
neuron activation, for this it uses 1 + qm-—1 parameters. If we identify the input 
(k) 


m—1 
typically, the filter sizes fE < q® fo 1 < k < K — 1. Thus, the CN 
operation uses much less parameters as the filters only act locally through the 
x-operation by translating the filter window (9.5). 


and it maps this input - a spatial object of 


3 ; ! 
dimensions qm-1 = m 14 we can observe that rm << 1 + qm-1 because, 


This understanding now allows us to define a CN layer. Note that the map- 
pings (9.6) have a lower index j which indicates that this is one single projection 
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(filter extraction), called a filter. By choosing multiple different filters T , we), 
we can define the CN layer as follows. 


Choose gi) €e N filters, each having a r,-dimensional filter weight 


Ga. We), 1l<j< q9. A CN layer is a mapping 


a) (K) a) (K) 
ZO) : Rim- Xam- — RIM X= Xam (9.7) 


zie z™ (z) = ao soas zio ©) > 


4 a (K-1) i : 
with filters z{” (z) € Ran xxam 1 < j < Gg. given by (9.6). 


A CN layer (9.7) converts the gi input channels to qk ) output filters by 
preserving the spatial structure on the first K — 1 components of the input tensor z. 
More mathematically, CN layers and networks have been studied, among others, 
by Zhang et al. [403, 404], Mallat [263] and Wiatowski—Bolcskei [382]. These 
authors prove that CN networks have certain translation invariance properties 
and deformation stability. This exactly explains why these networks allow one to 
recognize similar objects at different locations in the input tensor. Basically, by 
translating the filter windows (9.5) across the tensor, we try to extract the local 
structure from the tensor that provides similar signals in different locations of that 
tensor. Thinking of an image where we try to recognize, say, a dog, such a dog can 
be located at different sites in the image, and a filter (window) that moves across 
that image tries to locate the dogs in the image. 

A CN layer (9.7) defines one layer indexed by the upper index “”), and for deep 
representation learning we now have to compose multiple of these CN layers, but we 
can also compose CN layers with FN layers or RN layers. Before doing so, we need 
to introduce some special purpose layers and tools that are useful for CN network 
modeling, this is done in Sect. 9.2, below. 


9.1.3 Example: Time-Series Analysis and Image Recognition 


Most CN network examples are based on time-series data or images. The former 
has a 1-dimensional temporal component, and the latter has a 2-dimensional spatial 
component. Thus, these two examples are giving us tensors of orders K = 2 and 
K = 3, respectively. We briefly discuss such examples as specific applications of a 
tensors of a general order K > 2. 
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Time-Series Analysis with CN Networks 


For a time-series analysis we often have observations x; € R? for the time points 
0 < t < T. Bringing this time-series data into a tensor form gives us 


D O 
x = xr = (xo,... xr)! € RŪtD*0 = RD *% , 


with aa” = T + 1 and a” = qo. We have met such examples in Chap. 8 on RN 
networks. Thus, for time-series data the input to a CN network is a tensor of order 
K = 2 with a temporal component having the dimension T + 1 and at each time 
point t we have go measurements (channels) x, € R%°. A CN network tries to find 
similar structure at different time points in this time-series data xg.7. For a first CN 
layer m = 1 we therefore choose qı € N filters and consider the mapping 


72) : REtDX4 > RU-fit2)xa1 (9.8) 
1 
the = ZY (xdr) = (z! edn), anes 2 (xer)) ; 


with filters ame) e RTTA 1 < j < qı, given by (9.6) and for a fixed 
window size fı € N. From (9.8) we observe that the length of the time-series is 
reduced from T + 1 to T — fı + 2 accounting for the window size fı. In financial 
mathematics, a structure (9.8) is often called a rolling window that moves across the 
time-series xo:r and extracts the corresponding information. 

We have introduced two different architectures to process time-series information 
xo:7, and these different architectures serve different purposes. A RN network 
architecture is most suitable if we try to forecast the next response of a time- 
series. I.e., we typically process the past observations through a recurrent structure 
to predict the next response, this is the motivation, e.g., behind Figs. 8.4 and 8.5. 
The motivation for the use of a CN network architecture is different as we try to 
find similar structure at different times, e.g., in a financial time-series we may be 
interested in finding the downturns of more than 20%. The latter is a local analysis 
which is explored by local filters (of a finite window size). 


Image Recognition 


Image recognition extends (9.8) by one order to a tensor of order K = 3. Typically, 
we have images of dimensions (pixels) Z x J, and having three color channels RGB. 
These images then read as 


D QB) 
x = (x1,x2,x3) € RS = RO M40 *40 


where x; € R!*/ is the intensity of red, x2 € R/*/ is the intensity of green, and 
x3 € R!*! is the intensity of blue. 
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Dy = 


Chose a window size of f} and qı € N filters to receive the CN layer 


D. plxdx3 _, BUR-APEDX Uf 4D x41 (9.9) 


1 
(x1, x2, x3) > zP (x1, x2, x3) = (z$ (x1, X2, X3), ... CACETE) 


with filters ER x2, x3) € REZAD TDXU= f+) 1 < j < qı. Thus, we 
compress the 3 channels in each filter j, but we preserve the spatial structure of 
the image (by the convolution operation *). 

For black and white pictures which only have one color channel, we preserve the 
spatial structure of the picture, and we modify the input tensor to a tensor of order 3 
and of the form 


= (x1) € R/x/x1, 


9.2 Special Purpose Tools for Convolutional Neural 
Networks 


9.2.1 Padding with Zeros 


We have seen that the CN operation reduces the size of the output by the filter sizes, 
see (9.1). Thus, T we start from an image of size 100 x 50 x 1, and if the filter sizes 
are given by fP 52) = 9, then the output will be of dimension 92 x 42 x q®, 
see (9.9). Somenmes ‘this reduction in dimension is impractical, and padding helps 
to keep the original shape. Padding a tensor z with pÉ ) parameters, 1 < k < K — 1, 
means that the tensor is extended in all K — 1 spatial directions by (typically) adding 
zeros of that size, so that the padded tensor has dimension 


Ji 1) 1 K-1 K-1 K-1 K 
(p+ a, + aD) x--- (PEP aE + aR?) x a. 


This implies that the output filters will have the dimensions 
k k k 
an a a Ps 


for 1 < k < K —1. The spatial dimension of the original tensor size is preserved if 
2 pe — o + 1 = 0. Padding does not add any additional parameters, but it is only 


used to R the tensors. 
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9.2.2 Stride 


Strides are used to skip part of the input tensor z in order to reduce the size of the 
output. This may be useful if the input tensor is a very high resolution image. Choose 


the stride parameters 5 1 < k < K — 1. We can then replace the summation 
in (9.3) by the following term 


1 K 
fn g9 


m 


wp 1 
an > u| tie Ze SD (DANY Ge DHK- lK 


h=1 [x= 


This only extracts the tensor entries on a discrete grid of the tensor by translating 
the window by multiples of integers, see also (9.5), 


(Pa = 1)... Paa =, 1) + [12 AP] xx [ts AP] o: 6 = 1], 


and the size of the output is reduced correspondingly. If we choose strides sO = 


f®, 1 < k < K —1, we receive a partition of the spatial part of the input tensor z, 
this is going to be used in the max-pooling layer (9.11). 


9.2.3 Dilation 


Dilation is similar to stride, though, different in that it enlarges the filter sizes instead 
of skipping certain positions in the input tensor. Choose the dilation parameters 9, 
1 <k < K — 1. We can then replace the summation in (9.3) by the following term 


Yoe > w™ iz. a) K-1) 
llki j Sitem (ly —Do-six item (l-1) 
This applies the filter weights to the tensor entries on discrete grids 
easier DM [oe 60 i ea per tea], 


where the intervals et 10: fF — 1] run over the grids of span sizes eb) ,l<k< 
K — 1. Thus, in comparably smoothing images we do not read all the pixels but only 
every eth pixel in the window. Also this reduces the size of the output tensor. 
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9.2.4 Pooling Layer 


As we have seen above, the dimension of the tensor is reduced by the filter 
size in each spatial direction if we do not apply padding with zeros. In general, 
deep representation learning follows the paradigm of auto-encoding by reducing a 
high-dimensional input to a low-dimensional representation. In CN networks this 
is usually (efficiently) done by so-called pooling layers. In spirit, pooling layers 
work similarly to CN layers (having a fixed window size), but we do not apply a 
convolution operation x, but rather a maximum operation to the window to extract 
the dominant tensor elements. 
We choose a fixed window size (fi?,..., f*~P)T © N¥-! and strides s® = 
Ck) 1 < k < K —1, for the spatial components of the tensor z of order K. A 
max-pooling layer is given by 


K 
7) : pat D xg ®, 3 RI xoxak 


z œ> z™ (z) = MaxPool(z), (9.10) 


with dimensions gi = oy andforl<k< kK-—1 


a ae 0.11) 


having the activations for 1 < ig < a, 1<k<kK, 


MaxPool(z = max a Rei ee 
ir, << f®, ZEP iD, hE eax 


1<keK—1 


Alternatively, the floors in (9.11) could be replaced by ceilings and padding with 
zeros to receive the right cardinality. This extracts the maximums from the (spatial) 
windows 


(Wa —1),. .. FEM (ik =1); ix) + |1: We =x [! i) x [0] 


= [Pa -D +1: Pi] x eae ae Diea xie, 


for each channel 1 < ig < qx 3 individually. Thus, the max-pooling operator is 
chosen such that it extracts the maximum of each channel and each window, the 
windows providing a partition of the spatial part of the tensor. This reduces the 
dimension of the tensor according to (9.11), e.g., if we consider a tensor of order 3 
of an RGB image of dimension J x J = 180 x 50 and apply a max-pooling layer 
with window sizes fP = 10 and f£ ® — 5, we receive a dimension reduction 


180 x 50x3 > 18x10x23. 
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Replacing the maximum operator in (9.10) by an averaging operator is sometimes 
also used, and this is called an average-pooling layer. 


9.2.5 Flatten Layer 


A flatten layer performs the transformation of rearranging a tensor to a vector, so that 
the output of a flatten layer can be used as an input to a FN layer. That is, 


a 


(m) xx gi 
Z $ RIm-1 Im-1 > Rin 


: 
ze O (arogo gto)» O12) 


a Im—19°Im=1 


with gm = má: q® 1- We have already used flatten layers after embedding layers 


m— 


on lines 8 and 11 of Listing 7.4. 


9.3 Convolutional Neural Network Architectures 


9.3.1 Illustrative Example of a CN Network Architecture 


We are now ready to patch everything together. Assume we have RGB images 
described by tensors x e R/*/*3 of order 3 modeling the three RGB channels 
of images of a fixed size J x J. Moreover, we have the tabular feature information 
x € X C {1} x RI that describes further properties of the data. That is, we have an 
input variable (x, x“), and we aim at predicting a response variable Y by a using 
a suitable regression function 


(x x) > pa, x) = alee. (9.13) 


We choose two convolutional layers z‘CN)) and z‘CN”), each followed by a max- 
pooling layer zM) and zM% respectively. Then we apply a flatten layer ze) 
to bring the learned representation into a vector form. These layers are chosen 
according to (9.7), (9.10) and (9.12) with matching input and output dimensions 
so that the following composition is well-defined 


so = (a o zMax2) o 2(CN2) 4 7(Maxl) o -DE . RII _, Ra. 


Listing 9.1 provides an example starting from a 7 x J x 3 = 180 x 50 x 3 input tensor 
x© and receiving a q5 = 60 dimensional learned representation z®:” (x) € R®. 


SCOMADNMNHWNHE 


= 
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Listing 9.1 CN network architecture in keras 


shape <- c(180,50,3) 

# 

model = keras_model_sequential() 

model %>% 
layer_conv_2d(filters = 10, kernel size = c(11,6), aċtivation=" tanb"; 

input_shape = shape) %>% 

layer_max_pooling 2d(pool_ size = c(10,5)) % 
layer_conv_2d(filters = 5, kernel_size = c( 
layer_max_pooling 2d(pool_ size = c(3,2)) > 
layer_flatten() 


> 
6,4), activation=’tanh’) %>% 


Listing 9.2 Summary of CN network architecture 


Layer (type) Output Shape Param # 

convad_i (Conv2D) (None, 170, 45, 10) 1990 
max_pooling2d_1 (MaxPooling2D) (None, 17, 9,10) . 
conv2d_2 (Conv2DÞ) (None, 12, 6, 5S) 1205 
max pooling2d_2 (MaxPooling2D) (None, 4, 3, 5) 0000 on 
flatten 1 (Flatten) (None, 60) . 


Total params: 3,195 
Trainable params: 3,195 
Non-trainable params: 0 


Listing 9.2 gives the summary of this architecture providing the dimension reduction 
mappings (encodings) 


180 x 50 x 3S! 170 x 45 x 10431 17 x 9 x 10 PY 12x 6x 5 4x 3 x 5 PS" 60. 


The first CN layer (m = 1) involves q®rı = 10. (1+ 11-6-3) = 1'990 filter weights 
(wp), Wy rete (including the intercepts), and the second CN layer (m = 3) 
involves Co n= = 5-(1+6-4-10) = 1/205 filter weights (wy), w®) o Altogether 
we have a network parameter of dimension 3'195 to be fitted in this CN network 
architecture. 

To perform the prediction task (9.13) we concatenate the learned representation 
Zz) (x%) € R3 of the RGB image x with the tabular feature x" € ¥ c {1} x R4. 


This concatenated vector is processed through a FN network architecture z +5® of 
depth d > 1 providing the output 


ie eae) E a[¥ |x, x] ag a (22a), 


l<j<q 


for given link function g. This last step can be done in complete analogy to Chap. 7, 
and fitting of such a network architecture uses variants of the SGD algorithm. 
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9.3.2 Lab: Telematics Data 


We present a CN network example that studies time-series of telematics car driving 
data. Unfortunately, this data is not publicly available. Recently, telematics car 
driving data has gained much popularity in actuarial science, because this data 
provides information of car drivers that goes beyond the classical features (age of 
driver, year of driving test, etc.), and it provides a better discrimination of good and 
bad drivers as it is directly based on the driving habits and the driving styles. 

The telematics data has many different aspects. Raw telematics data typically 
consists of high-frequency GPS location data, say, second by second, from which 
several different statistics such as speed, acceleration and change of direction can 
be calculated. Besides the GPS location data, it often contains vehicle speeds 
from the vehicle instrumental panel, and acceleration in all directions from an 
accelerometer. Thus, often, there are 3 different sources from which the speed and 
the acceleration can be extracted. In practice, the data quality is often an issue as 
these 3 different sources may give substantially different numbers, Meng et al. [271] 
give a broader discussion on these data quality issues. The telematics GPS data 
is often complemented by further information such as engine revolutions, daytime 
of trips, road and traffic conditions, weather conditions, traffic rule violations, etc. 
This raw telematics data is then pre-processed, e.g., special maneuvers are extracted 
(speeding, sudden acceleration, hard braking, extreme right- and left-turns), total 
distances are calculated, driving distances at different daytimes and weekdays are 
analyzed. For references analyzing such statistics for predictive modeling we refer to 
Ayuso et al. [17-19], Boucher et al. [42], Huang—Meng [193], Lemaire et al. [246], 
Paefgen et al. [291], So et al. [344], Sun et al. [347] and Verbelen et al. [370]. A 
different approach has been taken by Wiithrich [388] and Gao et al. [151, 154, 155], 
namely, these authors aggregate the telematics data of speed and acceleration to 
so-called speed-acceleration v-a heatmaps. These v-a heatmaps are understood as 
images which can be analyzed, e.g., by CN networks; such an analysis has been 
performed in Zhu—Wiithrich [407] for image classification and in Gao et al. [154] 
for claim frequency modeling. Finally, the work of Weidner et al. [377, 378] directly 
acts on the time-series of the telematics GPS data by performing a Fourier analysis. 

In this section, we aim at allocating individual car driving trips to the right drivers 
by directly analyzing the time-series of the telematics data of these trips using CN 
networks. We therefore replicate the analysis of Gao—Wiithrich [156] on slightly 
different data. For our illustrative example we select 3 car drivers and we call them 
driver A, driver B and driver C. For each of these 3 drivers we choose individual 
car driving trips of 180 seconds, and we analyze their speed-acceleration-change in 
angle (v-a-A) pattern every second. Thus, fort = 1,..., T = 180, we study the three 
input channels 


X51 = (Vs,t, ast, Ast)" € [2, 50]km/h x [—3, 3]m/s? x [0, 1/2] C RÊ, 
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where 1 < s < S labels all individual trips of the considered drivers. This data has 
been pre-processed by cutting-out the idling phase and the speeds above 50km/h 
and concatenating the remaining pieces. We perform this pre-processing since 
we do not want to identify the drivers because they have a special idling phase 
picture or because they are more likely on the highway. Acceleration has been 
censored at +3m/s because we cannot exclude that more extreme observations are 
caused by data quality issues (note that the acceleration is calculated from the GPS 
coordinates and if the signals are not fully precise it can lead to extreme acceleration 
observations). Finally, change in angle is measured in absolute values of sine per 
second (censored at 1/2), i.e., we do not distinguish between left and right turns. 
This then provides us with three time-series channels giving tensors of order 2 


+ 
T T 180x3 
Xs = ((ws.1, ds,1, As,1) zee’ (Us,1805 As ,180, As,180) ) e R”, 


for 1 < s < S. Moreover, there is a categorical response Y, € {A, B, C} indicating 
which driver has been driving trip s. 

Figure 9.1 illustrates the first three trips xs of T = 180 seconds of each of these three 
drivers A (top), B (middle) and C (bottom); note that the 180 seconds have been 
chosen at a random location within each trip. The first lines in red color show the 
acceleration patterns (a;)1<;<7, the second lines in black color the change in angle 
patterns (A,;)1<;<r, and the last lines in blue color the speed patterns (v,)1)<;<r. 
Table 9.1 summarizes the available data. In total we have 932 individual trips, and 
we randomly split these trips in the learning data £ consisting of 744 trips and the 
test data 7 collecting the remaining trips. The goal is to train a classification model 
that correctly allocates the test data 7 to the right driver. As feature information, we 
use the telematics data x, of length 180 seconds. We design a logistic categorical 
regression model with response set Y = {A, B, C}. Hence, we obtain a vector-valued 
parameter EF with a response having 3 levels, see Sect. 2.1.4. 

To process the telematics data x;, we design a CN network architecture having 
three convolutional layers z‘N), 1 < j < 3, each followed by a max-pooling 
layer z™"*/), then we apply a drop-out layer z?® and finally a fully-connected FN 
layer z™N) providing the logistic response classification; this is the same network 
architecture as used in Gao—Wiithrich [156]. The code is given in Listing 9.3 and it 
describes the mapping 


n ax2 w a 
z&D = (ae o z PO o z(Max3) o 2(CN3) o 7(Max2) o z(CN2) g z (Maxi) a) 
RT” —> (0,1). 


The first CN and pooling layer z™**!) o z(CND maps the dimension 180 x 3 to a 
tensor of dimension 58 x 12 using 12 filters; the max-pooling uses the floor (9.11). 
The second CN and pooling layer z™*?) o z(CN? maps to 18 x 10 using 10 filters, 
and the third CN and pooling layer z™*3) o z‘CN3) maps to 1 x 8 using 8 filters. 
Actually, this last max-pooling layer is a global max-pooling layer extracting the 
maximum in each of the 8 filters. Next, we apply a drop-out layer with a drop-out 
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Fig. 9.1 First 3 trips of driver A (top), driver B (middle) and driver C (bottom); each trip is 180 
seconds, red color shows the acceleration pattern (a;),;, black color the change in angle pattern 
(Ar): and blue color the speed pattern (v;); 


Table 9.1 Summary of the trips and the choice of learning and test data sets £ and T 


Driver [Toa 
Number of trips S 286 932 
Learning data £ 228 744 
Test data T 188 
Average speed v; 30.2 km/h 
Average acceleration/braking |q;| 0.74 m/s? 
Average change in angle A; 0.076 Isinl/s 


rate of 30% to prevent from over-fitting. Finally we apply a fully-connected FN 
layer that maps the 8 neurons to the 3 categorical outputs using the softmax output 
activation function, which provides the canonical link of the logistic categorical EF. 


=... 
WN COMAAIADWNSPWNE 
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Listing 9.3 CN network architecture for the individual car trip allocation 
shape <- c(180,3) 


model = keras_model_ sequential () 
model %>% 


ayer _conv_ld(filters = 12, kernel size = 5, activation=’tanh’, 
input_shape = shape) %>% 
ayer max pooling 1d(pool_ size = 3) %>% 
ayer_conv_ld(filters = 10, kernel_size = 5, activation=’tanh’) %>% 
ayer max pooling 1d(pool_ size = 3) %>% 
ayer _conv_ld(filters = 8, kernel size = 5, activation=’tanh’) %>% 
ayer global_max_pooling 1d() %>% 
ayer dropout (rate = .3) %>% 
ayer _dense(units = 3, activation = ‘softmax’) 


For a summary of the network architecture see Listing 9.4. Altogether this involves 
1’237 network parameters that need to be fitted. 


Listing 9.4 Summary of CN network architecture for the individual car trip allocation 


Layer (type) Output Shape Param # 


convid1 (ConviD) mone, 176, 12)—“‘<;7 ia 
max_poolingid_1 (MaxPooling1D) (None, 58, 12), . | 
convid_2 (ConviD) (None, 54, tO) 610 
max_poolingid_2 (MaxPooling1D) (None, 18, 10) . 
convid_3 (ConviD) (None, 1, YS 408 
global_max_poolingld_1 (GlobalMaxPool (None, 8) 0. | 
dropout_1 (Dropout) — None, a3 . | 
dense 1 (Dense) ——isti—“‘<i‘“‘ésNOM?SC'YS 27 


Total params: 1,237 
Trainable params: 1,237 
Non-trainable params: 0 


We choose the 744 trips of the learning data £ to train this network to the 
classification task, see Table 9.1. We use the multi-class cross-entropy loss function, 
see (4.19), with 80% of the learning data £ as training data U/ and the remaining 
20% as validation data V to track over-fitting. We retrieve the network with the 
smallest validation loss using a callback, we refer to Listing 7.3 for a callback. 
Since the learning data is comparably small and to reduce randomness, we use the 
nagging predictor averaging over 10 different network fits (using different seeds). 
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Table 9.2 Out-of-sample 


: : True labels 
contusion manmi Driver A | Driver B | Driver C 
Predicted label A 39 10 2 
Predicted label B 9 66 6 
Predicted label C 4 2 50 
% correctly allocated | 75.0% 84.6% 86.2% 
# of trips in test data | 52 78 58 


These fitted networks then provide us with a mapping 
: i : i ; T 
Sue oy, xr Ba) = (OPW, GPW, LPH) , 


and for each trip xs € IR? 3 we receive the classification 


f, = arg max ZED x). 
ye{A,B,C} ` 


Table 9.2 shows the out-of-sample results on the test data 7. On average more than 
80% of all trips are correctly allocated; a purely random allocation would provide 
a success rate of 33%. This shows that this allocation problem can be solved rather 
successfully and, indeed, the CN network architecture is able to learn structure in 
the telematics trip data x, that allows one to discriminate car drivers. This sounds 
very promising. In fact, the telematics car driving data seems to be very transparent 
which, of course, also raises privacy issues. On the downside we should mention 
that from this approach we cannot really see what the network has learned and how 
it manages to distinguish the different trips. 

There are several approaches that try to visualize what the network has learned 
in the different layers by extracting the filter activations in the CN layers, others 
try to invert the networks trying to backtrack which activations and weights mostly 
contribute to a certain output, we mention, e.g., DeepLIFT of Shrikumar et al. [339]. 
For more analysis and references we refer to Sect. 4 of the tutorial Meier-Wüthrich 
[269]. We do not further discuss this and close this example. 


9.3.3. Lab: Mortality Surface Modeling 


We revisit the mortality example of Sect. 8.4.2 where we used a LSTM architecture 
to process the raw mortality data for forecasting, see Fig. 8.13. We are going to do 
a (small) change to that architecture by simply replacing the LSTM encoder by a 
CN network encoder. This approach has been promoted in the literature, e.g., by 
Perla et al. [301], Schnürch-Korn [330] and Wang et al. [375]. A main difference 
between these references is whether the mortality tensor is considered as a tensor 


COmMAIDMPWNK 
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of order 2 (reflecting time-series data) or of order 3 (reflecting the mortality surface 
as an image). In the present example we are going to interpret the mortality tensor 
as a monochrome image, and this requires that we extend (8.23) by an additional 
channels component 


Xiri- = (Xi-r,- +, x1)" 
TX(x1—xo+1)x1 5x100x1 

(Mrs) eee ae ees eR eee =R 

for a lookback period of t = 5. The LSTM cell encodes this tensor/matrix into a 20- 

dimensional vector which is then concatenated with the embeddings of the country 

code and the gender code (8.24). We use the same architecture here, only the LSTM 

part is replaced by a CN network in (8.25), the corresponding code is given on lines 

14-17 of Listing 9.5. 


Listing 9.5 CN network architecture to directly process the raw mortality rates (Mx,t)x,t 


Tensor = layer_input (shape=c(lookback,100,1), dtype='float32’, name=’Tensor’ ) 
Country = layer_input(shape=c(1), dtype=’int32’, name='Country’ ) 

Gender = layer_input(shape=c(1), dtype=’int32’, name='Gender’ ) 

Time = layer_input(shape=c(1), dtype=’float32’, name=’Time’ ) 

# 


CountryEmb = Country %>% 
layer_embedding(input_dim=8, output_dim=1,input_length=1,name=’CountryEmb’) %>% 
layer_flatten(name=' Country flat’) 

# 

GenderEmb = Gender %>% 
layer_embedding(input_dim=2,output_dim=1,input_length=1,name=’GenderEmb’) %>% 
layer _flatten(name='’Gender flat’) 


# 

CN = Tensor %>% 
layer_conv_2d(filter = 10, kernel size = c(5,5), activation = ‘linear’) %>% 

layer_max_ pooling 2d(pool_ size = c(1,8)) %>% 
layer_flatten() 

# 

Output = list (CN,CountryEmb,GenderEmb) %>% layer _concatenate() %>% 
layer_dense(units=100, activation=’linear’, name=’scalarproduct’) %>% 
layer_reshape(c(1,100), name = ‘Output’ ) 

# 

model = keras_model(inputs = list(Tensor, Country, Gender), 


outputs = c(Output) 


Line 15 maps the input tensor 5 x 100 x 1 to a tensor 1 x 96 x 10 having 10 filters, the 
max-pooling layer reduces this tensor to 1 x 12 x 10, and the flatten layer encodes 
this tensor into a 120-dimensional vector. This vector is then concatenated with the 
embedding vectors of the country and the gender codes, and this provides us with 
r = 12'570 network parameters, thus, the LSTM architecture and the CN network 
architecture use roughly equally many network parameters that need to be fitted. We 
then use the identical partition in training, validation and test data as in Sect. 8.4.2, 
i.e., we use the data from 1950 to 2003 for fitting the network architecture, which is 
then used to forecast the calendar years 2004 to 2018. The results are presented in 
Table 9.3. 
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Table 9.3 Comparison of the out-of-sample mean squared losses for the calendar years 2004 < 
t < 2018; the figures are in 1074 


Female Male 
[LC [LSTM__|CN LC LSTM CN 
Austria AUT (0.765 [0.312 | 0.635 2.527 1.169 1.569 
Belgium BE [0371 (0311 [0.290  |2.835_ | 0.960 1.100 
Switzerland CH (0.654  |0.478 0.772 1.609 1.134 2.035 
Spain ESP [1.446 |0.514 0.199 [1.742 [0.245 0.240 
France FRA 0.175 |1684 [0.309 [0.333 | 0.363 0.770 
Italy ITA [0.179 | 0.330 0.186 0.874 0.320 0.421 
The Netherlands NL [0.426  |0.315 | 0.266 1.978 0.601 0.606 
Portugal POR 2.097 |0.464 0.416 1.848 1.239 1.880 


We observe that in our case the CN network architecture provides good results for 
the female populations, whereas for the male populations we rather prefer the LSTM 
architecture. At the current stage we rather see this as a proof of concept, because 
we have not really fine-tuned the network architectures, nor has the SGD fitting 
been perfected, e.g., often bigger architectures are used in combination with drop- 
outs, etc. We refrain from doing so, here, but refer to the relevant literature Perla 
et al. [301], Schniirch—Korn [330] and Wang et al. [375] for a more sophisticated 
fine-tuning. 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 10 A 
Natural Language Processing gett 


Natural language processing (NLP) is a vastly growing field that is studying lan- 
guage, communication and text recognition. The purpose of this chapter is to present 
an introduction to NLP. Important milestones in the field of NLP are the work of 
Bengio et al. [28, 29] who have introduced the idea of word embedding, the work 
of Mikolov et al. [275, 276] who have developed word2vec which is an efficient 
word embedding tool, and the work of Pennington et al. [300] and Chaubard et 
al. [68] who provide the pre-trained word embedding model GloVe! and detailed 
educational material.2 An excellent overview of the NLP working pipeline is 
provided by the tutorial of Ferrario—Nagelin [126]. This overview distinguishes 
three approaches: (1) the classical approach using bag-of-words and bag-of-part- 
of-speech models to classify text documents; (2) the modern approach using word 
embeddings to receive a low-dimensional representation of the dictionary, which 
is then further processed; (3) the contemporary approach uses a minimal amount 
of text pre-processing but directly feeds raw data to a machine learning algorithm. 
We discuss these different approaches and show how they can be used to extract 
the relevant information from claim descriptions to predict the claim types and the 
claim sizes; in the actuarial literature first papers on this topic have been published 
by Lee et al. [236] and Manski et al. [264]. 


10.1 Feature Pre-processing and Bag-of-Words 


NLP requires an extensive feature pre-processing and engineering as different texts 
can be rather diverse in language, grammar, abbreviations, typos, etc. The current 
developments aim at automating this process, nevertheless, many of these steps 


l https://nlp.stanford.edu/projects/glove/. 
2 https://nlp.stanford.edu/teaching/. 
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are still (tedious) manual work. Our goal here is to present the whole working 
pipeline to process language, perform text recognition and text understanding. As 
an example we use the claim data described in Chap. 13.3; this data has been made 
available through the book project of Frees [135], and it comprises property claims 
of governmental institutions in Wisconsin, US. An excerpt of the data is given in 
Listing 10.1; our attention applies to line 11 which provides a (very) short claim 
description for every claim. 


Listing 10.1 Excerpt of the Wisconsin Local Government Property Insurance Fund (LGPIF) data 
set with short claim descriptions on line 11 


‘data.frame’: 5424 obs. of 10 variables: 


$ PolicyNum : int 120002 120003 120003 120003 120003 120003 120003 

$ Year : int 2010 2007 2008 2007 2009 2010 2007 2007 2009 2007 

$ Claim : num 6839 2085 8775 600 34610 

$ Deduct : int 1000 5000 5000 5000 5000 5000 5000 5000 5000 5000 

$ EntityType : Factor w/ 6 levels "City","County",..: 2222222222 

$ CoverageCode: Factor w/ 13 levels "CE","CF","CS",..: 12 12 11 11 11 12 

$ Fire5 : int4000000000 

$ CountyCode : Factor w/ 72 levels "ADA","ASH","BAR",..: 2 3 3 3 3 333... 

$ Hazard : Factor w/ 9 levels "Fire","Hail",..: 3355963333 

$ Description : chr "lightning damage" "lightning damage at Comm. Center" ... 


In a first step we need to pre-process the texts to make them suitable for predictive 
modeling. This first step is called tokenization. Essentially, tokenization labels the 
words with integers, that is, the used vocabulary is encoded by integers. There are 
several issues that one has to deal with in this first step such as upper and lower 
case, punctuation, orthographic errors and differences, abbreviations, etc. Different 
treatments of these issues will lead to different results, for more on this topic we 
refer to Sect. 1 in Ferrario—Nagelin [126]. We simply use the standard routine 
offered in R keras [77] called text_ tokenizer () with its standard settings. 


Listing 10.2 Tokenization within R keras [77] 
library (keras) 


## initialize tokenizer and fit 
tokenizer <- text_tokenizer() %>% fit_text_tokenizer(dat$Description) 


## number of tokens/words 
length (tokenizer$word_index) 


## frequency of word appearances in each text 
freq.text <- texts_to_matrix(tokenizer, dat$Description, mode = "count") 


The R code in Listing 10.2 shows the crucial steps in tokenization. Line 4 extracts 
the relevant vocabulary from all available claim descriptions. In total the 5424 claim 
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descriptions of Listing 10.1 use W = 2'237 different words. This double counts 
different spellings, e.g., ‘color’ vs. ’colour’. 

Figure 10.1 shows the most frequently used words in the claim descriptions of 
Listing 10.1. These are (in this order): ‘at’, ‘damage’, ‘damaged’, ‘vandalism’, 
‘lightning’, ‘to’, ‘water’, ‘glass’, ‘park’, ‘fire’, ‘hs’, ‘wind’, ‘light’, ‘door’, ‘es’, 
‘and’, ‘of’, ‘vehicle’, ‘pole’ and ‘power’. We observe that many of these words 
are directly related to insurance claims, such as ‘damage’ and ‘vandalism’, others 
are frequent stopwords like ‘at’ and ‘to’, and then there are abbreviations like ‘hs’ 
and ‘es’ standing for high school and elementary school. 


Listing 10.3 Word and text encoding 


maxlen <- max(rowSums (freq.text) ) 


## encode the sentences 
text.seq <- texts_to_sequences (tokenizer, dat$Description) 


## pad the sentences 


text.seq.pad <- pad_sequences (text.seq, maxlen = maxlen, padding = "post") 
## examples 
lightning/hail damage to equip at airport 


5 48 2 6 196 1 40 0 0 0 0 


# 
garage door damaged 
3614 3 0 0 0 0 0 0 0 0 


The next step is to assign the (integer) labels 1 < w < W from the tokenization 
to the words in the texts. The maximal length over all texts/sentences is T = 11 
words. This step and padding the sentences with zeros to equal length T is presented 
on lines 1-7 of Listing 10.3. Lines 11 and 14 of this listing give two explicit text 
examples 


text = (w1,...,wr)! E€ wi, 
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where we set for the vocabulary Wo used 
W = {1,..., W} CN and Wo = W U {0}. 


The label O is used for padding shorter texts to the common length T = 11. The 


method of bag-of-words embeds text = (w1, ..., wr)! into Nw 
T T 
y: WẸ > NY, text > y(text) = (£ timmu) . (10.1) 
t=1 weW 


The bag-of-words y (text) counts how often each word w € W appears in a given 
text = (w1, ..., wr)! ; the corresponding code is given on line 10 of Listing 10.2. 
The bag-of-words mapping y is not injective as the order of occurrence of the 
words gets lost, and, thus, also the semantics of the sentence gets lost. E.g., the 
bag-of-words of the following two sentences is the same “The claim is expensive.’ 
and ‘Is the claim expensive?’. This is the reason for calling it a “bag of words” 
(which is unordered). This bag-of-words encoding resembles one-hot encoding, 
namely, if every text consists of a single word T = 1, then we receive the one-hot 
encoding with W describing the number of different levels, see (7.28). The bag-of- 
words y(text) € NW can directly be used as an input to a regression model. The 
disadvantage of this approach is that the input typically is high-dimensional (and 
likely sparse), and it is recommended that only the frequent words are considered. 


Listing 10.4 Removal of stopwords and lemmatization 


library (textstem) 
library (tm) 


text.clean <- removeWords(dat$Description, stopwords ("english") ) 
text.clean <- lemmatize strings(text.clean, dictionary = lexicon: :hash_lemmas) 


Additionally, stopwords can be removed. We perform this removal below because 
frequent stopwords like ‘and’ or ‘to’ may not essentially contribute to the under- 
standing of the (short) claim descriptions; the code for the stopword removal is 
provided on line 4 of Listing 10.4. Moreover, stemming can be performed which 
means that inflectional forms are reduced to their stem by just truncating pre- and 
suffixes, conjugations, declensions, etc. Lemmatization is a more sophisticated form 
of reducing inflectional forms by using vocabularies and morphological analyses; 
an example is provided on line 5 of Listing 10.4. If we perform these two steps 
of removing stopwords and lemmatization to our example, the number of different 
words is reduced from 2’237 to 1°982. 

Another step that can be performed is tagging words with part-of-speech (POS) 
attributes. These POS attributes indicate whether the corresponding words are used 
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as nouns, adjectives, adverbs, etc., in the corresponding sentences. We then call the 
resulting encoding bag-of-POS. We refrain from doing this because we will present 
more sophisticated methods in the next sections. 


10.2 Word Embeddings 


The bag-of-words (10.1) can be interpreted as representing each word w € W = 
{1,..., W} by a one-hot encoding in {0, 1}, and then aggregating these one-hot 
encodings over all words that appear in the given text = (w1, ..., wr)! . Bengio 
et al. [28, 29] have introduced the technique of word embedding that maps words 
to a lower dimensional Euclidean space R’, b « W, such that proximity in R? 
is associated with similarity in the meaning of the word, e.g., ‘rain’, ‘water’ and 
‘flood’ should be more close to each other in R? than to ‘vandalism’ (in an insurance 
context). This is exactly the idea promoted in the embedding mapping (7.31) using 
the embedding layers. Thus, we are looking for an embedding mapping 


e:W >RP, w e(w), (10.2) 


that maps each word w (or rather its tokenization) to a b-dimensional vector e(w), 
for a given embedding dimension b < W. The general idea now is that similarity in 
the meaning of words can be learned from the context in which the words are used 
in. That is, when we consider a text 


T 
text = (W1, ..., Wr—-1, Wr, Wt+1, .-., WT) ’ 


then it might be possible to infer w, from its neighbors w;_; and w;+j;, j = 1. This 
explains the context of a word w,, and using suitable learning tools it should also be 
possible to learn synonyms for w; as these synonyms will stand in similar contexts. 

More mathematically speaking, we assume that there exists a probability distri- 
bution p over the set of all texts of length T (using padding with zeros to common 
length) 


T= [text = (wi,...,wr)"| c wå, 


such that a randomly chosen text € T appears with probability p(w1, ..., WT) € 
[0, 1). Inference of a word w; from its context can then be obtained by studying the 
conditional probablity of w;, given its context, that is 


p(wy,..., wr) 


DP(W1,.--, Wr—-1, Wri, - +, wr) 
(10.3) 


p(w wi, ..., W1, Wri... WT) = 
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Since, typically, the probability distribution p is not known we aim at learning it 
from the available data. This idea has been taken up by Mikolov et al. [275, 276] 
who designed the word to vector (word2vec) algorithm. Pennington et al. [300] 
designed an alternative algorithm called global vectors (GloVe); we also refer to 
Chaubard et al. [68]. We describe these algorithms in the following sections. 


10.2.1 Word to Vector Algorithms 


There are two ways of estimating the probability p in (10.3). Either we can try to 
predict the center word w; from its context as in (10.3) or we can try to predict the 
context from the center word w;, which applies Bayes’s rule to (10.3). The latter 
variant is called skip-gram and the former variant is called continuous bag-of-words 
(CBOW), if we neglect the order of the words in the context. These two approaches 
have been developed by Mikolov et al. [275, 276]. 


Skip-gram Approach 


Typically, inferring a general probability distribution p over T is too complex. 
Therefore, we make a simplifying assumption. This simplifying assumption is not 
reasonable from a practical linguistic point of view, but it is sufficient to receive a 
reasonable word embedding map e : W —> R?. We assume conditional i.i.d. of the 
context words, given the center word w;. Choosing a fixed context (window) size 
c € N, we try to maximize the log-likelihood over all probabilities p satisfying this 
conditional i.i.d. assumption 


n 


tw = Yo log p (wire. veo, Wit-1, Witt, -++ Witte] Wir) 


i=l 


n 
= 2 > log p (wir+j| wit), (10.4) 


i=l —c<j<c,j#0 


having n independent rows in the observed data matrix W = (wjt—c,..., 
Wi,ttc)l<i<n € w"x(2c+D Thus, under the conditional i.i.d. of the context words, 
given the center word, the probabilities (10.4) infer the occurrence of (individual) 
context words of a given center word w; within a symmetric window of fixed size 
c. In the sequel we directly work with the log-likelihood (10.4), supposed that a 
context word w; +j exists for index j, otherwise the corresponding term is just 
dropped from the sum in (10.4). 

The remaining step is to estimate the conditional probabilities p(w;+ ;|w;) from 
the data matrix W. This step will provide us with the embeddings (10.2). This 
estimation step is received by considering an approach similar to a GLM for 
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categorical responses, see Sect. 5.7. We make the following ansatz for the context 
word ws and the center word wy (for all j) 


A i ce Gay, (10.5) 


ai exp (€(w), e(wr)) 


where e and @ are two (different) embedding maps (10.2) that have the same 
embedding dimension b € N. Thus, we construct two different embeddings e and € 
for the center words and for the context words, respectively, and these embeddings 
(embedding weights) are chosen such that the log-likelihood (10.4) is maximized 
for the given observations W. These assumptions give us a minimization problem 
for the negative log-likelihood in the embedding mappings, i.e., we minimize over 
the embeddings e and @ 


—lw = =) 2 log (Gps e(wis)) (10.6) 


wW a 
i=l —c<j<c,j£0 Dwar exp (ew), e(wi,1)) 


n WwW 
=— X ( > (elwit), e(wi,)) — 2c log 2 exp lw), 5) . 


i=1 \-c<js<c,j40 w=1 


These optimal embeddings are learned using a variant of the gradient descent 
algorithm. This often results in a very high-dimensional optimization problem as 
we have 2bW parameters to learn, and the calculation of the last (normalization) 
term in (10.6) can be very expensive in gradient descent algorithms. For this reason 
we present the method of negative sampling below. 


Continuous Bag-of-Words 


For the CBOW method we start from the log-likelihood for a context size c € N and 
given the observations W 


n 
X log p (Wis| Wages ses Wiens yeti sig Witte) - 


i=l 


Again we need to reduce the complexity which requires an approximation to the 
above. Assume that the embedding map of the context words is given by € : W > 
RŻ. We then average over the embeddings of the context words in order to predict 
the center word. Define the average embedding of the context words of w;,, (with a 
fixed window size c) by 


= 1 pe 
eit = F 5 e(wi +j). 


—e<j<c,jA0 
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Making an ansatz similar to (10.5), the full log-likelihood is approximated by 


X log p (wis| či) = De (rnc (10.7) 
i=l 


1 EXP (è; t> e(w)) 


n WwW 
a > i,t, e(wi.t)) — log (>: exp i,t, x) . 
i=] 


w=1 


Again the gradient descent method is applied to the negative log-likelihood to learn 
the optimal embedding maps e and @. 


Remark 10.1 In both cases, skip-gram and CBOW, we estimate two separate 
embeddings e and @ for the center word and the context words. Typically, CBOW is 
faster but skip-gram is better on words that are less frequent. 


Negative Sampling 


There is a computational issue in (10.6) and (10.7) because the probability normal- 
izations in (10.6) and (10.7) aggregate over all available words w € W. This can 
be computationally demanding because we need to perform this calculation in each 
gradient descent step. For this reason, Mikolov et al. [276] turn the log-likelihood 
optimization problem (10.6) into a binary classification problem. Consider a pair 
(w, w) € W x W of center word w and context word w. We introduce a binary 
response variable Y e {1,0} that indicates whether an observation (W, W) = 
(w, W) is coming from a true center-context pair (from our texts) or whether 
we have a fake center-context pair (that has been generated randomly). Choosing 
the canonical link of the Bernoulli EF (logistic/sigmoid function) we make the 
following ansatz (in the skip-gram approach) to test for the authenticity of a center- 
context pair (w, w) 


= 1 


The recipe now is as follows: (1) Consider for a given window size c all center- 
context pairs (w;, Wi) € Wx W of our texts, and equip them with a response Y; = 1. 
Assume we have N such observations. (2) Simulate N i.i.d. pairs (Wy +k, Wn+k)s 
1 < k < N, by randomly choosing Wy+, and Wynik, independent from each 
other (by performing independent re-sampling with or without replacements from 
the data (w;)1<i<y and (Ùi)1<i<yN, respectively). Equip these (false) pairs with the 
response Yy 4x4 = 0. (3) Maximize the following log-likelihood as a function of the 
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embedding maps e and @ 


2N 
ty =) logP[Y = Y;| wi, 0] (10.9) 
i=1 
N 1 2N 1 
= l — EEEE=ontareuEeeneeere l — } . 
D = (; + exp(—€(iii), a j 2s = (; + exp(e(iix), oy 


This approach is called negative sampling because we sample false or negative 
pairs (Wn+k, Wyk) that should not appear in our texts (as Wy+g and Wwe have 
been generated independently from each other). The binary classification (10.9) 
aims at detecting the negative pairs be letting the scalar products (€(w;), e(w;)) 
be large for the true pairs and letting the scalar products (€(wx), e(wx)) be small 
for the false pairs. The former means that @(w;) and e(w;) should point into the 
same direction in the embedding space R?. The same should apply for a synonym 
of w; and, thus, we receive the desired behavior that synonyms or words with similar 
meanings tend to cluster. 


Example 10.2 (word2vec with Negative Sampling) We provide an example by 
constructing a word2vec embedding based on negative sampling. For this we aim 
at maximizing the log-likelihood (10.9) by finding optimal embedding maps e and 
Z: W — RP. To construct these embedding maps we use the Wisconsin LGPIF 
data described in Sect. 13.3. The first decision (hyper-parameter) is the choice of the 
embedding dimension b. English language has millions of different words, and these 
words should be (in some sense) densely embedded into a b-dimensional Euclidean 
space. Typical choices of b vary between 50 and 300. Our LGPIF data vocabulary is 
much smaller, and for this example we choose b = 2 because this allows us to nicely 
illustrate the learned embeddings. However, apart from illustration, we should not 
choose such a small dimension as it does not allow for a sufficient flexibility in 
discriminating the words, as we will see. 

We consider all available claim texts described in Sect. 13.3. These are 6031 
texts coming from the training and validation data sets (we include the validation 
data here to have more texts for learning the embeddings; this is different from 
Sect. 10.1). We extract the claim descriptions from these two data sets and we apply 
some pre-processing to the texts. This involves transforming all letters to lower case, 
removing the special characters like !”/&, and removing the stopwords. Moreover, 
we remove the words ‘damage’ and ‘damaged’ as these two words are very common 
in our insurance claim descriptions, see Fig. 10.1, but they do not further specify 
the claim type. Then we apply lemmatization, see Listing 10.4, and we adjust the 
vocabulary with the GloVe database,’ see also Example 10.4. The latter step is 


3 https://nlp.stanford.edu/projects/glove/. 
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(tedious) manual work, and we do this step to be able to compare our results to 
pre-trained word2vec versions. 

After this pre-processing we apply the tokenizer, see line 4 of Listing 10.2. This 
gives us 1'829 different words. To construct our (illustrative) embedding we only 
consider the words that appear at least 20 times over all texts, these are W = 142 
words. Thus, the following analysis is only based on the W = 142 most frequent 
words. Of course, we could increase our vocabulary by considering any text that can 
be downloaded from the internet. Since we would like to perform an insurance claim 
analysis, these texts should be related to an insurance context so that the learned 
embeddings reflect an insurance experience; we come back to this in Remark 10.4, 
below. We refrain here from doing so and embed these W = 142 words into the 
Euclidean plane (b = 2). 


Listing 10.5 Tokenization of the most frequent words 


## applying the tokenizer to the cleaned texts 
tokenizer <- text_tokenizer(num_words=142+41) %>% fit_text_tokenizer(dat$clean) 


segs <- texts_to_sequences(tokenizer, dat$clean) 


## skip-gram of text 1 using a window of size 2 
skipgrams (sequence=unlist (seqs[[1]]), 
vocabulary _size=142, window_size=2, negative _samples=0) 


Listing 10.5 shows the tokenization of the most frequent words, and on line 4 we 
build the (shortened) texts w1, w2,..., only considering these most frequent words 
we W = {1,..., W}. In total we receive 4’746 texts that contain at least two words 
from W and, hence, can be used for the skip-gram building of center-context pairs 
(w, w) € W x W. Lines 7-8 give the code for building these pairs for a window of 
size c = 2. In total we receive N = 23'952 center-context pairs (w;, Wi) from our 
texts. We equip these pairs with a response Y; = 1. For the false pairs, we randomly 
permute the second component of the true pairs (Wy4+i, Ww-i) = (wi, Wri), 
where t is a random permutation of {1,..., N}. These false pairs are equipped 
with a response Yy+; = 0. Thus, altogether we have 2N = 47'904 observations 
(Yi, wi, wi), L < j < 2N, that can be used to learn the embeddings e and €. 

Listing 10.6 shows the R code to perform the embedding learning using the negative 
sampling (10.9). This network has 2bW = 568 embedding weights that need to 
be learned from the data. There are two more parameters involved on line 10 of 
Listing 10.6. These two parameters shift the scalar products by an intercept Bo and 
scale them by a constant 6;. We could set (Bo, 61) = (0, 1), however, keeping 
these two parameters trainable has led to results that are better centered around the 
origin. Of course, these two parameters do not harm the arguments as they only 
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Listing 10.6 R code for negative sampling 


center = layer_input(shape = c(1), dtype = ‘int32’) 
context = layer input {shape = c(1), dtype = ‘int32’) 
# 


centerEmb = center %>% 
layer_embedding(input_dim=142,output_dim=2,input_length=1) %>% layer flatten() 
contextEmb = context %>% 
layer_embedding(input_dim=142,output_dim=2,input_length=1) %>% layer flatten() 


# 

response = list(centerEmb, contextEmb) %>% layer _dot(axes = 1) %>% 
layer_dense(units=1, activation=’sigmoid’, name=’response’ ) 

# 

model = keras_model(inputs = c(center, context), outputs = c(response) ) 


replace (10.8) by a slightly different model 


1 ePo 


PIY =1 w] = ——— r E 
Y= TOT TE exp Bo — BNE), e A Fe RTT CT” 


and 


Bo — Bo 
P[Y =0|w,@]=1- á : 


ePo + e~ Pı (e(w),e(w)) e- Po + eP1 (€w), e(w)) 3 


We fit this model using the nadam version of the gradient descent algorithm, and 
the fitted embedding weights can be extracted with get_weights (model). 
Figure 10.2 shows the learned embedding weights e(w) € R? of all words w € W. 
We highlight the words that coincide with the insured hazards in red color, see line 
10 of Listing 10.1. The word ‘vehicle’ is in the first quadrant and it is surrounded 
by ‘pole’, ‘truck’, ‘garage’, ‘car’, ‘traffic’. The word ‘vandalism’ is in the third 
quadrant surrounded by ‘graffito’, ‘window’, ‘pavilion’, names of cites and parks, 
‘ms’ for middle school. Finally, the words ‘fire’, ‘wind’, ‘lightning’ and ‘hail’ are 
in the first and fourth quadrant, close to ‘water’; these words are surrounded by 
‘bldg’ (building), ‘smoke’, ‘equipment’, ‘alarm’, ‘safety’, ‘power’, ‘library’, etc. We 
conclude that these embeddings make perfect sense in an insurance claim context. 
Note that we have applied some pre-processing, and embeddings could even be 
improved by further pre-processing, e.g., “vandalism’ and ‘vandalize’ or ‘hs’ and 
‘high school’ are used. 

Another nice observation is that the embeddings tend to build a circle around the 
origin, see Fig. 10.2. This is enforced by embedding W = 142 different words into 
a b = 2 dimensional space so that dissimilar words optimally repulse each other. m 
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2-dimensional embedding of center word 
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Fig. 10.2 Two-dimensional skip-gram embedding using negative sampling; in red color are the 
insured hazards ‘vehicle’, ‘fire’, ‘lightning’, ‘wing’, ‘hail’, ‘water’ and ‘vandalism’ 


10.2.2 Global Vectors Algorithm 


A second popular word embedding approach is global vectors (GloVe) developed 
by Pennington et al. [300], we also refer to Chaubard et al. [68]. GloVe is an 
unsupervised learning method that performs a word-word clustering (center-context 
pairs) over all available texts. Assume that the tokenization of all texts provides us 
with the words w € W. Choose a fixed context window size c € N and define the 
matrix 


C= (Cw, D) gew € NY, 


with C(w, w) counting the number of co-occurrences of w and Ù over all available 
texts where the word w appears as a context word of the center word w (for the 
given window size c). We note that C is a symmetric matrix that is typically sparse 
as many words do not appear in the context of other words (on finitely many 
texts). Figure 10.3 shows the center-context pairs (w, Ùw) co-occurrence matrix C 
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Fig. 10.3 Center-context co-occurrence matrix 
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of Example 10.2 which is based on W = 142 words and 23’952 center-context 
pairs. The color pixels indicate the pairs that occur in the data, C(w, ©) > 0, and 
the white space corresponds to the pairs that have not been observed in the texts, 
C(w, w) = 0. This plot confirms the sparsity of the center-context pairs; the words 
are ordered w.r.t. their frequencies in the texts. 

In an empirical analysis Pennington et al. [300] have observed that the crucial 
quantities to be considered are the ratios for fixed context words. That is, for a 
context word w study a function of the center words w and v (subject to existence 
of the right-hand side) 


: Cw, )/ View Cw.) Pw) 
(w, v, w) => EN a Ca = aaa 


p denoting the empirical probabilities. An empirical analysis suggests that such an 
approach seems to lead to a good discrimination of the meanings of the words, see 
Sect. 3 in Pennington et al. [300]. Further simplifications and assumptions provide 
the following ansatz, for details we refer to Pennington et al. [300], 


logC(w, ©) © (E), e(w)) + Ba + Bw, 


with intercepts Bu. Bw € R. There is still one issue, namely, that log C (w, Ùw) may 
not be well-defined as certain pairs (w, Ù) are not observed. Therefore, Pennington 
et al. [300] propose to solve a weighted squared error loss function problem to find 
the embedding mappings e, € and intercepts Ba. Bw € R. Their objective function 
is given by 


Y x(C(w, ©) (log C(w, ©) — E), e(w)) — Bs — Bw)”, (10.10) 


w,weW 
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with weighting function 


Aima 
s20 rws (Bs), 


Xmax 


for xmax > 0 and æ > 0. Pennington et al. [300] state that the model depends 
weakly on the cutoff point xmax, they propose xmax = 100, and a sub-linear 
behavior seems to outperform a linear one, suggesting, e.g., a choice of a = 3/4. 
Under these choices the embeddings e and @ are found by minimizing the objective 
function (10.10) for the given data. Note that lim, Jo x (x) dog x)? =0. 


Example 10.3 (GloVe Word Embedding) We provide an example using the GloVe 
embedding model, and we revisit the data of Example 10.2; we also use exactly the 
same pre-processing as in that example. We start from N = 23/952 center-context 
pairs. 

In a first step we count the number of co-occurrences C(w, w). There are only 
4972 pairs that occur, C(w, w) > 0, this corresponds to the colors in Fig. 10.3. 
With these 4’972 pairs we have to fit 568 embedding weights (for the embedding 
dimension b = 2) and 284 intercepts Ba. Bw, thus, 852 parameters in total. The 
results of this fitting are shown in Fig. 10.4. 

The general picture in Fig. 10.4 is similar to Fig. 10.2, e.g., ‘vandalism’ is 
surrounded by ‘graffito’, ‘window’, ‘pavilion’, names of cites and parks, ‘ms’ 
and ‘es’; or ‘vehicle’ is surrounded by ‘pole’, ‘traffic’, ‘street’, ‘signal’. However, 
the clustering of the words around the origin shows a crucial difference between 
GloVe and the negative sampling of word2vec. The problem here is that we do 
not have sufficiently many observations. We have 4’972 center-context pairs that 
occur, C(w, w) > 0. 2’396 of these pairs occur exactly once, C(w, w) = 1, this is 
almost half of the observations with C(w, w) > 0. GloVe (10.10) considers these 
observations on the log-scale which provides log C(w, W) = 0 for the pairs that 
occur exactly once. The weighted square loss for these pairs is minimized by either 
setting (W) = 0 or e(w) = 0, supposed that the intercepts are also set to 0. This 
is exactly what we observe in Fig. 10.4 and, thus, successfully fitting GloVe would 
require much more (frequent) observations. a 


Remark 10.4 (Pre-trained Word Embeddings) In practical applications we rely on 
pre-trained word embeddings. For GloVe there are pre-trained versions that can be 
downloaded.* These pre-trained versions comprise a vocabulary of 400K words, 
and they exist for the embedding dimensions b = 50, 100, 200, 300. These GloVe’s 
have been trained on Wikipedia 2014 and Gigaword 5 which provided roughly 6B 
tokens. Another pre-trained open-source model that can be downloaded is spaCy.° 


4 https://nlp.stanford.edu/projects/glove/. 
5 https://spacy.io/models/en#en_core_web_md. 
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Fig. 10.4 Two-dimensional GloVe embedding; in red color are the insured hazards ‘vehicle’, 
‘fire’, ‘lightning’, ‘wind’, ‘hail’, ‘water’ and ‘vandalism’ 


Pre-trained embeddings can be problematic if we work in very specific settings. 
For instance, the Wisconsin LGPIF data contains the word ‘Lincoln’ in the claim 
descriptions. Now, Lincoln is a county in Wisconsin, it is town in Kewaunee County 
in Wisconsin, it is a former US president, there are Lincoln memorials, it is a 
common street name, it is a car brand and there are restaurants with this name. 
In our context, Lincoln is most commonly used w.r.t. the Lincoln Elementary and 
Middle Schools. On the other hand, it is likely that in pre-trained embeddings a 
different meaning of Lincoln is predominant, and therefore the embedding may not 
be reasonable for our insurance problem. 


eee 
NF COUOAANDNSPWNKE 


440 10 Natural Language Processing 
10.3 Lab: Predictive Modeling Using Word Embeddings 


This section gives an example of applying the word embedding technique to a 
predictive modeling setting. This example is based on the Wisconsin LGPIF data 
set illustrated in Listing 10.1. Our goal is to predict the hazard types on line 10 
of Listing 10.1 from the claim descriptions on line 11. We perform the same data 
cleaning process as in Example 10.2. This provides us with W = 1/829 different 
words, and the resulting (short) claim descriptions have a maximal length of T = 9. 
After padding with zeros we receive n = 6'031 claim descriptions given by texts 
(w1,...,wr)' € Wo ; we apply the padding to the left end of the sentences. 


Word2vec Using Negative Sampling We start by the word2vec embedding tech- 
nique using the negative sampling. We follow Example 10.2, and to successfully 
embed the available words w € W we restrict the vocabulary to the words that are 
used at least 20 times. This reduces the vocabulary from 1’892 different words to 
142 different words. The number of claim descriptions are reduced to 5’883 because 
148 claim descriptions do not contain any of these 142 different words and, thus, 
cannot be classified as one of the hazard types (based on this reduced vocabulary). 

In a first analysis we choose the embedding dimension b = 2, and this provides 
us with the word2vec embedding map that is illustrated in Fig. 10.2. Based on these 
embeddings we aim at predicting the hazard types from the claim descriptions. We 
have 9 different hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW, 
Vehicle, Vandalism and Misc.° Therefore, we design a categorical classification 
model that has 9 different labels, we refer to Sect. 2.1.4. 


Listing 10.7 R code for the hazard type prediction based on a word2vec embedding 


input = layer_input(shape = list(T), name = "input") 
# 
word2vec = input %>% 


layer_embedding(input_dim = W+1, output_dim = b, input_length = T, 
weights=list (wordEmb), trainable=FALSE) %>% 
layer_flatten() 
# response = word2vec %>% 


layer_dense(units=20, activation=’tanh’, name='’FNLayerl’) %>% 
layer _dense(units=15, activation=’tanh’, name=’FNLayer2’) %>% 
layer_dense(units=9, activation=’softmax’, name=’output’ ) 

# 

model = keras_model(inputs = c(input), outputs = c(response) ) 


The R code for the hazard type prediction is presented in Listing 10.7. The crucial 
part is shown on line 5. Namely, the embedding map e(w) € R?, w € W is 
initialized with the embedding weights wordEmb received from Example 10.2, and 


6 WaterW relates to weather related water claims, and WaterNW relates to non-weather related 
water claims. 


10.3 Lab: Predictive Modeling Using Word Embeddings 441 
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Fig. 10.5 Confusion matrices of the hazard type prediction using a word2vec embedding based on 
negative sampling (lhs) b = 2 dimensional embedding and (rhs) b = 10 dimensional embedding; 
columns show the observations and rows show the predictions 


these embedding weights are declared to be non-trainable.’ These features are then 
inputted into a FN network with two FN layers having (q1, g2) = (20, 15) neurons, 
and as output activation we choose the softmax function. This model has 286 non- 
trainable embedding weights, andr = (9-2 + 1)20+ (20+ 1)15+ (15+ 1)9 = 839 
trainable parameters. 

We fit this network using the nadam version of the gradient descent method, and 
we exercise an early stopping on a 20% validation data set (of the entire data). This 
network is fitted in a few seconds, and the results are presented in Fig. 10.5 (lhs). 
This figure shows the confusion matrix of prediction vs. observed (row vs. column). 
The general results look rather good, there are only difficulties to distinguish WaterN 
from WaterNW claims. 

In a second analysis, we increase the embedding dimension to b = 10 and 
we perform exactly the same procedure as above. A higher embedding dimension 
allows the embedding map to better discriminate the words in their meanings. 
However, we should not go for a too high b because we have only 142 different 
words and 47’904 center-context pairs (w, Ù) to learn these embeddings e(w) € R?. 
A higher embedding dimension also increases the number of network weights in 
the first FN layer on line 9 of Listing 10.7. This time, we need to train r = 
(9 - 10 + 1)20 + (20 + 1)15 + (15 + 1)9 = 2/279 parameters. The results are 
presented in Fig. 10.5 (rhs). We observe an overall improvement compared to the 
2-dimensional embeddings. This is also confirmed by Table 10.1 which gives the 
deviance losses and the misclassification rates. 


7 The zeros from padding are mapped to the origin. 
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Table 10.1 Hazard prediction results summarized in deviance losses and misclassification rates 


Number of parameters | Deviance | Misclassification 


Embedding “Network | loss rate 
word2vec negative sampling, b = 2 286 839 0.1442 19.9% 
word2vec negative sampling, b = 10 | 1’430 2279 [0.0912 | 13.7% 
FN GloVe using all words, b = 50 91°500 9°479 0.0802 11.7% 
LSTM GloVe using all words, b = 50 | 91°500 3°369 0.0802 12.1% 
Word similarity embedding, b = 7 12’810 1°739 0.1396 =| 21.1% 


Pre-trained GloVe Embedding In a next analysis we use the pre-trained GloVe 
embeddings, see Remark 10.4. This allows us to use all W = 1'892 words that 
appear in the n = 6/031 claim descriptions, and we can also classify all these 
claims. I.e., we can classify more claims, here, compared to the 5’883 claims we 
have classified based on the self-trained word2vec embeddings. Apart from that, all 
modeling steps are chosen as above. Only the higher embedding dimension b = 50 
from the pre-trained glove .6B.50d increases the size of the network parameter 
tor = (9 -50 + 1)20 + (20+ 1)15 + (15 + 1)9 = 9'479 parameters; remark that 
the 91°500 embedding weights are not trained as they come from the pre-trained 
GloVe embeddings. Using the nadam optimizer with an early stopping provides us 
with the results in Fig. 10.6 (Ihs). Using this pre-trained GloVe embedding leads to a 
further improvement, this is also verified by Table 10.1. Using the pre-trained GloVe 
is two-fold. On the one hand, it allows us to use all words of the claim descriptions, 
which improves the prediction accuracy. On the other hand, the embeddings are 
not adapted to insurance problems, as these have been trained on Wikipedia and 
Gigaword texts. The former advantage overrules the latter shortcoming in our 
example. 


All the results above have been using the FN network of Listing 10.7. We made 
this choice because our texts have a maximal length of T = 9, which is very 
short. In general, texts should be understood as time-series, and RN networks are 
a canonical choice to analyze these time-series. Therefore, we study again the pre- 
trained GloVe embeddings, but we process the texts with a LSTM architecture, we 
refer to Sect. 8.3.1 for LSTM layers. 

Listing 10.8 shows the LSTM architecture used. On line 9 we set the variable 
return_sequences to true which implies that all intermediate steps zi 1< 
t < T, are outputted to a time-distributed FN layer on line 10, see Sect. 8.2.4 for 
time-distributed layers. This LSTM network has r = 4(50 + 1 + 10)10 + (10 + 
1)10 + (90 + 1)9 = 3/369 parameters. The flatten layer on line 11 of Listing 10.8 
turns the T = 9 outputs z!” e R@, 1 <t < T, of dimension q2 = 10 into a vector 
of size Tq2 = 90. This vector is then fed into the output layer on line 12. At this 
stage, one could reduce the dimension of the parameter by setting a max-pooling 
layer in between the flatten and the output layer. 


COINIDMNPWNE 
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Fig. 10.6 Confusion matrices of the hazard type prediction using the pre-trained GloVe with b = 
50 (lhs) FN network and (rhs) LSTM network; columns show the observations and rows show the 
predictions 


Listing 10.8 R code for the hazard type prediction using a LSTM architecture 


input = layer _ input (shape = list(T), name = "input") 
# 
word2vec = input %>% 


layer_embedding (input_dim = W+1, output_dim = b, input_length = T, 
weights=list (wordEmb), trainable=FALSE) %>% 
layer _flatten() 
# 
response = word2vec %>% 
layer_lstm(units=10, activation='’tanh’, return_sequences=TRUE, 
name='LSTM’) %>% 
time_distributed(layer_dense(units=10, activation=’tanh’, name=’FNLayer’)) %>% 
layer_flatten() %>% 
layer_dense(units=9, activation=’softmax’, name=’output’ ) 
# 


model = keras_model(inputs = c(input), outputs = c(response) ) 


We fit this LSTM architecture to the data using the pre-trained GloVe embed- 
dings. The results are presented in Fig. 10.6 (rhs) and Table 10.1. We receive the 
same deviance loss, and the misclassification rate is slightly worse than in the 
FN network case (with the same pre-trained GloVe embeddings). Note that the 
deviance loss is calculated on the estimated classification probabilities p(x) = 
(Pi(x),..., po(x))', and the labels are received by 


Y=Y (x) = arg max px (Xx). 
k=1.,...,9 


Thus, it may happen that the improvements on the estimated probabilities are not 
fully reflected on the predicted labels. 
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Word (Cosine) Similarity In our final analysis we work with the pre-trained GloVe 
embeddings e(w) € R® but we first try to reduce the embedding dimension b. For 
this we follow Lee et al. [236], and we consider a word similarity. We can define 
the similarity of the words w and w’ € W by considering the scalar product of their 
embeddings 


(e(w), e(w’)) 


|e(w) |lalle(w’)|I2 
(10.11) 


sim? (w, w’) = (e(w),e(w’)) or sim™ (w, w) = 


The first one is an unweighted version and the second one is a nor- 
malized version scaling with the corresponding Euclidean norms so that 
the similarity measure is within [—1,1]. In fact, the latter is also called 
cosine similarity. To reduce the embedding dimension and because we 
have a classification problem with hazard names, we can evaluate the 
(cosine) similarity of all used words w €e W to the hazards h € H = 
{fire, lightning, hail, wind, water, vehicle, vandalism}. Observe 
that wat er is further separated into weather related and non-weather related claims, 
and there is a further hazard type called misc, which collects all the rest. We could 
choose more words in H to more precisely describe these water and other claims. If 
we just use H we obtain a b = |H| = 7 dimensional embedding mapping 


7 
w EW > e®(w)= (sim (w, fire),..., sim™ (w, vandalism)) eR, 

(10.12) 

fora € {u,n}. This gives us for every text = (w1,..., wr)! € Wo the pre- 


processed features 


E 
text > (e (wi), -.-,ewr)) e RT”, (10.13) 


Lee et al. [236] apply a max-pooling layer to these embeddings which are then 
inputted into GAM classification model. We use a different approach here, and 
directly use the unweighted (a = u) text representations (10.13) as an input to a 
network, either of FN network type of Listing 10.7 or of LSTM type of Listing 10.8. 
If we use the FN network type we receive the results on the last line of Table 10.1 
and Fig. 10.7. 


Comparing the results of the word similarity through the embeddings (10.12) 
and (10.13) to the other prediction results, we conclude that this word similarity 
approach is not fully competitive compared to working directly with the word2vec 
or GloVe embeddings. It seems that the projection (10.12) does not discriminate 
sufficiently for our classification task. 
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10.4 Lab: Deep Word Representation Learning 


All examples above have been relying on embedding the words w e W into 
a Euclidean space e(w) € R? by performing a sort of unsupervised learning 
that provided word similarity clusters. The advantage of this approach is that 
the embedding is decoupled from the regression or classification task, this is 
computationally attractive. Moreover, once a suitable embedding has been learned, 
it can be used for several different tasks (in the spirit of transfer learning). The 
disadvantage of the pre-trained embeddings is that the embedding is not targeted to 
the regression task at hand. This has already been discussed in Remark 10.4 where 
we have highlighted that the meaning of some words (such as Lincoln) depends very 
much on its context. 

Recent NLP aims at pre-processing a text as little as necessary, but tries 
to directly feed the raw sentences into RN networks such as LSTM or GRU 
architectures. Computationally this is much more demanding because we have 
to learn the embeddings and the network weights simultaneously, we refer to 
Table 10.1 to indicate the number of parameters involved. The purpose of this short 
section is to give an example, though our NLP database is rather small; this latter 
approach usually requires a huge database and the corresponding computational 
power. Ferrario—Nagelin [126] provide a more comprehensive example on the 
classification of movie reviews. For their analysis they evaluated approximately 
50’000 movie reviews each using between 235 and 2’498 words. Their analysis 
was implemented on the ETH High Performance Computing (HPC) infrastructure 
Euler’, and their run times have been between 20 and 30 minutes, see Table 8 of 
Ferrario—Nagelin [126]. 


8 https://scicomp.ethz.ch/wiki/Euler 
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Since we neither have the computational power nor the big data to fit such 
a NLP application, we start the gradient descent fitting in the initial embedding 
weights e(w) € R? that either come from the word2vec or the GloVe embeddings. 
During the gradient descent fitting, we allow these weights to change w.r.t. the 
regression task at hand. In comparison to Sect. 10.3, this only requires minor 
changes to the R code, namely, the only modification needed is to change from 
FALSE to TRUE on lines 5 in Listings 10.7 and 10.8. This change allows us to 
learn adapted weights during the gradient descent fitting. The resulting classification 
models are now very high-dimensional, and we need to carefully assess the 
early stopping rule, otherwise the model will (in-sample) over-fit to the learning 
data. 
In Fig. 10.8 we provide the results that correspond to the self-trained word2vec 
embeddings given in Fig. 10.5, and the corresponding numerical results are given 
in Table 10.2. We observe an improvement in the prediction accuracy in both cases 
by letting the embedding weights being learned during the network fitting, and we 
receive a misclassification rate of 11.6% and 11.0% for the embedding dimensions 
b = 2 and b = 10, respectively, see Table 10.2. 
Figure 10.8 (rhs) illustrates how the embeddings have changed from the initial (pre- 
trained) embeddings e% (w) (coming from the word2vec negative sampling) to the 
learned embeddings @(w). We measure these changes in terms of the unweighted 
similarity measure defined in (10.11), and given by 


(ew), aw) (10.14) 


The upper horizontal line is a manually set threshold to identify the words w that 
experience a major change in their embeddings. These are the words ‘vandalism’, 
‘lightning’, ‘grafito’, ‘fence’, ‘hail’, ‘freeze’, ‘blow’ and ‘breakage’. Thus, these 
words receive a different embedding location/meaning which is more favorable for 
our classification task. 

A similar analysis can be performed for the pre-trained Glo Ve embeddings. There 
we expected bigger changes to the embeddings since the GloVe embeddings have 
not been learned in an insurance context, and the embeddings will be adapted to 
the insurance prediction problem. We refrain from giving an explicit analysis, here, 
because to perform a thorough analysis we would need (much) more data. 

We conclude this example with some remarks. We emphasize once more that 
our available data is minimal, and we expect (even much) better results for longer 
claim descriptions. In particular, our data is not sufficient to discriminate the weather 
related from the non-weather related water claims, as the claim descriptions seem 
to focus on the water claim itself and not on its cause. In a next step, one should use 
claim descriptions in order to predict the claim sizes, or to improve their predictions 
if they are based on classical tabular features, only. Here, we see some potential, in 
particular, w.r.t. medical claims, as medical reports may clearly indicate the severity 
of the claim as well as these reports may give some insight into the recovery process. 
Thus, our small example may only give some intuition of what is possible with 
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Fig. 10.8 Confusion matrices and the changes in the embeddings compared to the pre-trained 
word2vec embeddings of Fig. 10.5 for the dimensions b = 2 and b = 10 


Table 10.2 Hazard prediction results summarized in deviance losses and misclassification rates: 
pre-trained embeddings vs. network learned embeddings 


Number rof parameters Deviance | Misclass. 
loss 


word2vec negative sampling, b = 2 286 a 0.1442 19.9% 
word2vec improved embedding, b = 2 0.0814 11.7% 


word2vec negative sampling, b = 10 í e 0.0912 13.7% 
word2vec improved embedding, b = 10 3°709 0.0714 10.5% 
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(unstructured) text data. Unfortunately, the LGPIF data of Listing 10.1 did not give 
us any satisfactory results for the claim size prediction, this for several reasons. 
Firstly, the data is rather heterogeneous ranging from small to very large claims 
and any member of the EDF struggles to model this data; we come back to a 
different modeling proposal of heterogeneous data in Sect. 11.3.2. Secondly, the 
claim descriptions are not very explanatory as they are too short for a more detailed 
information. Thirdly, the data has only 5’424 claims which seems small compared 
to the complexity of the problem that we try to solve. 


10.5 Outlook: Creating Attention 


In text recognition problems, obviously, not all the words in a sentence have the 
same importance. In the examples above, we have removed the stopwords as they 
may disturb the key understanding of our texts. Removing the stopwords means that 
we pay more attention to the remaining words. RN networks often face difficulty 
in giving the right recognition to the different parts of a sentence. For this reason, 
attention layers have gained more popularity recently. Attention layers are special 
modules in network architectures that allow the network to impose more weight 
on certain parts of the information in the features to emphasize their importance. 
The attention mechanism has been introduced in Bahdanau et al. [21]. There are 
different ways of modeling attention, the most popular one is the so-called dot- 
product attention, we refer to Vaswani et al. [366], and in the actuarial literature we 
mention Kuo—Richman [231] and Troxler—Schelldorfer [354]. 

We start by describing a simple attention mechanism. Consider a sentence 
text = (wj,...,wr) E€ wi that provides, under an embedding map e : Wo > 
R?, the embedded sentence (e(w1),...,e(wr))' € RT}. We choose a weight 
matrix Ug € R°*? and an intercept vector ug € R?. Based on these choices we 
consider for each word wp of our sentence the score, called query, 


q, = tanh (ug + Uge(w;)) € (1, 1}. (10.15) 


Matrix Q = (q;,....q7)' € RT% collects all queries. It is obtained by 
applying a time-distributed FN layer with b neurons to the embedded sentence 
(e(w1),-..,e(wr))!. 

These queries q, are evaluated with a so-called key k € R? giving us the attention 
weights 


exp(k, q:) 


EPI 2.1). forl<t<T. (10.16) 
Z exp(k, gs) 


Qt = 
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Using these attention weights œ = (a,..., ar)! e (0, 1)? we encode the sentence 
text as 
T 
text = (w1,..., WT) œ w* = X welw) (10.17) 
t=] 
= (e(w1),...,e(wr))æ € RP. 


Thus, to every sentence text we assign a categorical probability vector a = 
a(text) € Ar, see Sect. 2.1.4, (6.22) and (5.69), which is encoding this sentence 
text to a b-dimensional vector w* € R®?. This vector is then further processed 
by the network. Such a construction is called a self-attention mechanism because 
the text (w1,..., WT) € wE is used to formulate the queries in (10.15), but, of 
course, these queries could also be coming from a completely different source. 
In the above set-up we have to learn the following parameters Ug € R?*? and 
uo,k € R?, assuming that the embedding map e : Wọ —> R? has already been 
specified. 

There are several generalizations and modifications to this self-attention mech- 
anism. The most common one is to expand the vector w* € R? in (10.17) to a 
matrix W* = (w7,..., w7) e R°*4, This matrix W* can be interpreted as having 
q neurons w € R, 1 < j < q. For this, one replaces the key k € R? by a matrix- 


valued key K = (kı, ..., kq) € IR°*4. This allows one to calculate the attention 
weight matrix 


A 


ae EUR 
t, i = Aar n 
J/1<t<T,1<j<q yh, exp (kj. 45) (he ier 


softmax (QK) € (0, 1)?*?, 


where the softmax function is applied column-wise. I.e., the attention weight matrix 
A € (0, 1)T*4 has columns æj = (1,;, ...,a7,j)" € Ar, 1 < j < q, which are 
normalized to total weight 1, this is equivalent to (10.16). This is used to encode the 
sentence text 


(e(w),...,e(wr)) € RPT œ> W* = (e(w1),...,e(wr)) A (10.18) 
T 
= (dye) E€ Res. 
t=1 l<j<q 


Mapping (10.18) is called an attention layer. Let us give some remarks. 
Remarks 10.5 


e Encoding (10.18) gives a natural multi-dimensional extension of (10.17). The 
crucial parts are the attention weights ~j € Ar which weigh the different words 
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(w;)1<t<7- In the multi-dimensional case, we perform this weighting mechanism 
multiple times (in different directions), allowing us to extract different features 
from the sentences. In contrast, in (10.17) we only do this once. This is similar 
as going form one neuron to a layer of g neurons. 

¢ The above structure uses a self-attention mechanism because the queries involve 
the words themselves, and the weight matrix Ug € IR?*? and the intercept vector 
ug € R? are learned with gradient descent. Concerning the key K € R?*4 
one often chooses another self-attention mechanism by choosing a (non-linear) 
function K = K (w1, ..., wr) to infer optimal keys. 

e These attention layers are also the building blocks of transformer models. 
Transformer models use attention layers (10.18) of dimension W* € R?XT and 
skip connections to transform the input 


W + W* 


TE RÈT, (10.19) 


W = (e(w1),..., e(wr)) € ROX? > 
Stacking multiple of these layers (10.19) transforms the original input W by 
weighing the important information in feature W for the prediction task at hand. 
Compared to LSTM layers this no longer sequentially screens the text but it 
directly acts on the part of the text that seems important. 

e The attention mechanism is applied to a matrix (e(w1),..., e(wr))! e RT”? 
which presents a numerical encoding of the sentence (w1, ..., wr)! E wi . 
Kuo-Richman [231] propose to apply this attention mechanism more generally 
to categorical feature components. Assume that we have T categorical feature 
components x1,..., XT, after embedding them into b-dimensional Euclidean 
spaces we receive a representation (e(x1),..., e(xr))! E Ree, see (7.31). 
Naturally, this can now be further processed by putting different attention on 
the components of this embedding exactly using an attention layer (10.18), 
alternatively we can use transformer layers (10.19). 


Example 10.6 We revisit the hazard type prediction example of Sect. 10.3. We 
select the b = 10 word2vec embedding (using negative sampling) and the 
pre-trained GloVe embedding of Table 10.1. These embeddings are then further 
processed by applying the attention mechanism (10.15)—(10.17) on the embeddings 
using one single attention neuron. Listing 10.9 gives the corresponding implemen- 
tation. On line 9 we have the query (10.15), on lines 10-13 the key and the attention 
weights (10.16), and on line 15 the encodings (10.17). We then process these 
encodings through a FN network of depth d = 2, and we use the softmax output 
activation to receive the categorical probabilities. Note that we keep the learned 
word embeddings e(w) as non-trainable on line 5 of Listing 10.9. 

Table 10.3 gives the results, and Fig. 10.9 shows the confusion matrix. We conclude 
that the results are rather similar, this attention mechanism seems to work quite well, 
and with less parameters, here. a 


NEE Re pe Re ee 
SSCHBIDAHDEBDBNKESOCMIDNVAWNHHE 


N 
= 
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Listing 10.9 R code for the hazard type prediction using an attention layer with q = 1 


input = layer_input (shape = list(T), name = "input") 
# 
word2vec = input %>% 


layer_embedding(input_dim = W+1, output_dim = b, input_length = T, 
weights=list (wordEmb), trainable=FALSE) %>% 
layer _flatten() 


# 
attention = word2vec %>% 
time_distributed(layer_dense(units=b, activation=’tanh’)) %>% 


time_distributed(layer _dense(units=1, activation=’linear’, 
use_bias=FALSE)) %>% 

layer_flatten() %>% 

layer_dense(unit=T, activation='softmax’, weights=list (diag(T)), 
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use_bias=FALSE, trainable=FALSE) 


# 

response = list(attention, word2vec) %>% layer _dot(axes=1) %>% 
layer_dense(units=20, activation=’tanh’) %>% 
layer _dense(units=15, activation=’tanh’) %>% 
layer_dense(units=9, activation='’softmax’ ) 

# 


model = keras_model (inputs = c(input), outputs = c(response) ) 


Table 10.3 Hazard prediction results summarized in deviance losses and misclassification 


rates 


Deviance 

word2vec negative sampling, b = 10 0.0912 13.7% 
word2vec attention, b = 10 0.0784 | 12.0% 
FN GloVe using all words, b = 50 0.0802 | 11.7% 
GloVe attention, b = 50 0.0824 | 12.6% 


Misclassification 


confusion matrix with embeddings b=10 confusion matrix with embeddings b=50 


14 4 


Fire 5 0 (0 0 2 9:16: 8 Fire = 1 3 0 0 1 2:13 
Light. + Light. 4 
Hail + Hail + 
Wind ~ Wind ~ 
Wat.W + Wat.W 5 
Wat.NW + Wat.NW + 
Vehicle 4 Vehicle 5 
Vand. 4 6 2 1 4 5 1 Vand. + 7 1 1 5 4 4 
Misc 4 7 19 0 l 9 18 : 19 a Misc- 8 E 1 8 15 : 15 : 37 am 
5 3 E 3 = Š 


2 
z 


Fig. 10.9 Confusion matrices of the hazard type prediction (lhs) using an attention layer on the 
word2vec embeddings with b = 10, and (rhs) using an attention layer on the pre-trained GloVe 


embeddings with b = 50; columns show the observations and rows show the predictions 
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Chapter 11 ® 
Selected Topics in Deep Learning E 


11.1 Deep Learning Under Model Uncertainty 


We revisit claim size modeling in this section. Claim size modeling is challenging 
because often there is no (simple) off-the-shelf distribution that allows one to 
appropriately describe all claim size observations. E.g., the main body of the claim 
size data may look like gamma distributed, and, at the same time, large claims seem 
to be more heavy-tailed (contradicting a gamma model assumption). Moreover, 
different product and claim types may lead to multi-modality in the claim size 
densities. In Sects. 5.3.7 and 5.3.8 we have explored a gamma and an inverse 
Gaussian GLM to model a motorcycle claims data set. In that example, the results 
have been satisfactory because this motorcycle data is neither multi-modal nor does 
it have heavy tails. These two GLM approaches have been based on the EDF (2.14), 
modeling the mean x +> u(x) with a regression function and assuming a constant 
dispersion parameter g > 0. There are two natural ways to extend this approach. 
One considers a double GLM with a dispersion submodel x +> g(x), see Sect. 5.5, 
the other explores multi-parameter extensions like the generalized inverse Gaussian 
model, which is a k = 3 vector-valued EF, see (2.10), or the GB2 family that 
involves 4 parameters, see (5.79). These extensions provide more complexity, also in 
MLE. In this section, we are not going to consider multi-parameter extensions, but 
in a first step we aim at robustifying (mean) parameter estimation within the EDF. 
In a second step we are going to analyze the resulting dispersion g(x). For these 
steps, we perform representation learning and parameter estimation under model 
uncertainty by simultaneously considering multiple models from Tweedie’s family. 
These considerations are closely related to Tweedie’s forecast dominance given in 
Definition 4.22. 
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We emphasize that we remain within a single distribution function choice in this 
section, i.e., we neither consider mixture distributions nor composite models in this 
section. Mixture density networks are going to be considered in Sect. 11.6, below, 
and a composite model approach is studied in Sect. 11.3, below. These mixture 
density networks and composite models allow us to model the body and the tail 
of the data with different distribution functions by either mixing or concatenating 
suitable distributions. 


11.1.1 Recap: Tweedie’s Family 


Tweedie’s family with power variance function V (u) = u’, p > 2, provides us 
with a rich model class for claim size modeling if the claim sizes are strictly positive, 
a.s., and extending to p € (1, 2) allows us to model claims with a positive point mass 
in 0. This class of distribution functions contains the gamma case (p = 2) and the 
inverse Gaussian case (p = 3). In general, p > 2 provides us with positive stable 
generated distributions and p € (1,2) gives Tweedie’s CP models, see Table 2.1. 
Tweedie’s family has cumulant function for p > 1 


2-—p 
z- p)0)=> for p > 1 and p #2, 


k (0) = Kp(0) = 
—log(—0) for p = 2, 


(11.1) 


on the effective domain 6 € © €e (—oo, 0) for p € (1, 2], and0 € © e (-~, 0] 
for p > 2. The mean and the power variance function are for p > | given by 


6 w=nO@)=(1—p)e)=? and pe Vu) =p? 


The unit deviance takes the following form for p > 1 and p # 2, see (4.18), 


l—p _ ,,l—p 2—p _ ,,2-p 
y a = > 0, (11.2) 


p(y, w) = 2 ( y - 
p(y, H) (> leg Z=p 


and in the gamma case p = 2 we have, see Table 4.1, 


YO, u)=2 (2 —1+log (£)) > 0. (11.3) 
u y 


Figure 11.1 (Ihs) shows the unit deviances y > dp (y, u) for fixed mean parameter 
u = 2 and power variance parameters p € {0, 2, 2.5, 3, 3.5}, the case p = 0 
corresponds to the symmetric Gaussian case 09(y, 4) = (y — u)?. We observe 
that with an increasing power variance parameter p large claims Y = y receive a 
smaller loss punishment (if we interpret the unit deviance as a loss function). This 
is the situation where we have a fixed mean u and where we assess claim sizes 
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unit deviances of power variance examples unit deviances of power variance examples 
24 — Gauss p=0 2-4: — Gauss p=0 
— gamma p=2 : — gamma p=2 
— case p=2.5 ! — case p=2.5 
— inverse Gauss p=3 : — inverse Gauss p=3 
a — case p=3.5 nee — case p=3.5 


unit deviance 


unit deviance 


data y mean mu 


Fig. 11.1 (lhs) Unit deviances y > 0,(y,) = O for fixed mean u = 2 and (rhs) unit 


deviances y +> p(y, u) = 0 for fixed observation y = 2 for power variance parameters 
p € {0, 2, 2.5, 3, 3.5} 


Y = y relative to this mean. For estimation purposes we have fixed observations 
Y = y and we study the sensitivities in u. Note that, in general, the unit deviances 
Dp (y, u) are not symmetric in y and u. This second case is shown in Fig. 11.1 (rhs), 
and the general behavior in p is similar. As a result, by selecting different hyper- 
parameters p > 1, we can control the influence of large (and small) claims on 
parameter estimation, because the unit deviances 0,(y, -) have different slopes for 
different p’s. Basically, the choice of the loss function (unit deviance) determines 
the choice of the underlying distributional model, which then assesses the claim 
observations Y = y according to their sizes and how these sizes match the model 
assumptions made. 

In Lemma 2.22 we have seen that the unit deviances 0, (y, 4) = O are zero if and 
only if y = u. The second derivatives given in Lemma 2.22 allow us to consider a 
second order Taylor expansion around a minimum uo = yo 


2 
€ 

Dp (yo + €Y, Ho + €u) = pay ole’) ase > 0. 
0 


Thus, locally around the minimum the unit deviances behave symmetric and like 
Gaussian squares, but this is only a local approximation around a minimum uo = yo 
as can be seen from Fig. 11.1. Le., in general, model fitting turns out to be rather 
different from the Gaussian square loss if we have small and large claim sizes under 
choices p > 1. 
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Remarks 11.1 


Since unit deviances are Bregman divergences, we know that every unit deviance 
gives us a strictly consistent scoring function for the mean functional, see 
Theorem 4.19. Therefore, the specific choice of the power variance parameter p 
seems less relevant. However, strict consistency is an asymptotic statement, and 
choosing a unit deviance that matches the property of the data has better finite 
sample properties, i.e., a smaller variance in asymptotic normality; we come back 
to this in Sect. 11.1.4, below. 

A function (y, u) œ> Wy, u) is called b-homogeneous if there exists b € R 
such that for all (y, 2) and all à > 0 we have w(Ay, Aw) = à? y (y, u). Unit 
deviances 0, are b-homogeneous with b = 2 — p. This b-homogeneity has 
the nice consequence that the decisions taken are independent of the scale, i.e., 
we have an invariance under changes of currencies. On the other hand, such a 
scaling influences the estimation of the dispersion parameter, i.e., if we scale the 
observation and the mean with à we have unit deviance 


Op(Ay, AM) = AP“? dp, p). (11.4) 


This influences the dispersion estimation for the cases different from the gamma 
case p = 2, see, e.g., saddlepoint approximation (5.60)-(5.62). This also relates 
to the different parametrizations in Sect. 5.3.8 where we study the inverse 
Gaussian model p = 3, which has a dispersion g; = 1/a; in the reproductive 
form and g; = 1 A in parametrization (5.51). 

We only consider power variance parameters p > 1 in this section for non- 
negative claim size modeling. Technically, this analysis could be extended to 
p € {0,1}. We do not consider the Gaussian case p = 0 to exclude negative 
claims, and we do not consider the Poisson case p = 1 because this is used for 
claim counts modeling. 


We recall that unit deviances of the EDF are equal to twice the corresponding 


KL divergences, which in turn are special cases of Bregman divergences. From 
Theorem 4.19 we know that Bregman divergences Dy are the only strictly 
consistent loss/scoring functions for mean estimation. 


Lemma 11.2 Choose p > 1. The scaled unit deviance 0p(y, u)/2 is a Bregman 
divergence Dy,(y, u) on Ry x R+ with strictly decreasing and strictly convex 
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function on R} 


=— y? for p > land 2; 
p= eho e e ” PrP Pe 
Hsieh ryn 


for canonical link hp (y) = T! O) = yP /( — p). 


Proof of Lemma 11.2 The Bregman divergence property follows from (2.29). For 
p > land y > 0 we have the strictly decreasing property 


WO) = hp) = y' 2/0 = p) <0. 


The second derivative is Y5 (y) = h, @) = y7? = 1/V (y) > 0 which provides the 
strict convexity. o 


In the Gaussian case we have wo(y) = y?/2, and WO) > 0 on R implies 
that this is a strictly increasing convex function for positive claims y > 0. This is 
different to Lemma 11.2. 

Assume we have independent observations (Y;,x;) following the same 
Tweedie’s distribution, and with means given by jz»(x;) for some parameter ?. 
The M-estimator of 3 using this Bregman divergence is given by 


n 
A v: 
ov = argmax fy(#) = argmin J — Dy, (Yi, Ly (xj)). 
v v ini ’ 


If we turn this M-estimator into a Z-estimator (supposed we have differentiability), 
the parameter estimate # is found as a solution of the score equations 


o 
II 


n 
! Ui 
-Və Ý owe (Yi, uo (xi) 


i=l 


» ; Wi (Ma (xi)) (Vi — uo (xi) Vono (xi) 
i=l 


n 


Sy L O A yuy) (11.5) 


n 
vi Yi — Mo (Xi) 
=) - — Vo He (xi). 
P Mali) 


In the GLM case this exactly corresponds to (5.9). To determine the Z-estimator 
from (11.5), we scale the residuals Y; — u; inversely proportional to the variances 
V (ui) = MA of the chosen Tweedie’s distribution. It is a well-known result that 


458 11 Selected Topics in Deep Learning 


if we scale individual unbiased estimators inversely proportional to their variances, 
we receive the unbiased estimator with minimal variance, we come back to this 
in (11.16), below. This gives us the intuition behind a specific choice of the power 
variance parameter for mean estimation, as the sizes of the variances u? scale 
(weight) the observed residuals Y; — m;i, and balance potential outliers in the 
observations correspondingly. 


11.1.2 Lab: Claim Size Modeling Under Model Uncertainty 


We present a proposal for deep learning under model uncertainty in this section. We 
explain this on an explicit example within Tweedie’s distributions. We emphasize 
that this methodology can be applied in more generality, but it is beneficial here to 
have an explicit example in mind to illustrate the different phenomena. 


Generalized Linear Models 


We analyze a Swiss accident insurance claims data set. This data is illustrated in 
Sect. 13.4, and an excerpt of the data is given in Listing 13.7. In total we have 
339’500 claims with positive payments. We choose this data set because it ranges 
from very small claims of 1 CHF to very large claims, the biggest one exceeding 
1°300’000 CHF. These claims are supported by feature information such as the labor 
sector, the injury type or the injured body part, see Listing 13.7 and Fig. 13.25. For 
our analysis, we partition the data into a learning data set £ and a test data set T. 
We do this partition stratified w.r.t. the claim sizes and in a ratio of 9 : 1. This 
results in a learning data set £ of size n = 305’550 and in a test data set 7 of 
size T = 33'950. 

We consider three Tweedie’s distributions with power variance parameters p € 
{2, 2.5, 3}, the first one is the gamma model, the last one the inverse Gaussian model, 
and the power variance parameter p = 2.5 gives a model in between. In a first step 
we consider GLMs, this requires feature engineering. We have three categorical 
features, one binary feature and two continuous ones. For the categorical and binary 
features we use dummy coding, and the continuous features Age and AccQuart 
are just included in its raw form. As link function g we choose the log-link which 
respects the positivity of the dual mean parameter space M, see Table 2.1, but 
this is not the canonical link of the selected models. In the gamma GLM this 
leads to a convex minimization problem, but in Tweedie’s GLM with p = 2.5 
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Table 11.1 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss 
(in 107°) and inverse Gaussian (IG) loss (in 10~3)) and AIC values; the losses use unit dispersion 
gy = 1, AIC relies on the MLE of g 


| In-sample loss on £ Out-of-sample loss on T AIC 


0 p=2 | 0 p=2.5 dp=3 dp=2 0 p=2.5 dp=3 value 
Null model 3.0094 | 10.2208 |4.6979 |3.0240 | 10.2420 | 4.6931 | 4’707°115 (IG) 
Gamma GLM | 2.0695 | 7.7127 | 3.9582 |2.1043 | 7.7852 | 3.9763 |47414712 
p =2.5GLM |2.0744 | 7.6971 | 3.9433 |2.1079 | 7.7635 | 3.9580 | 4’648°698 
IG GLM 2.0865 | 7.7069 | 3.9398 |2.1191 | 7.7730 |3.9541 | 4’653°501 


and in the inverse Gaussian GLM we have non-convex minimization problems, see 
Example 5.6. Therefore, we initialize Fisher’s scoring method (5.12) in the latter two 
GLMs with the solution of the gamma GLM. The gamma and the inverse Gaussian 
cases can directly be fitted with the R command g1m [307], for the power variance 
parameter case p = 2.5 we have coded our own MLE routine using Fisher’s scoring 
method. 

Table 11.1 shows the in-sample losses on the learning data £ and the corresponding 
out-of-sample losses on the test data 7. The fitted GLMs (gamma, power variance 
parameter p = 2.5 and inverse Gaussian) are always evaluated on all three unit 
deviances 0p=2(y, 4), 0p=2.5(y, u) and dp=3(y, u), respectively. We give some 
remarks. First, we observe that the in-sample loss is always minimized for the 
GLM with the same power variance parameter p as the loss 0, studied (2.0695, 
7.6971 and 3.9398 in bold face). This result simply states that the parameter 
estimates are obtained by minimizing the in-sample loss (or maximizing the 
corresponding in-sample log-likelihood). Second, the minimal out-of-sample losses 
are also highlighted in bold face. From these results we cannot give any preference 
to a single model w.r.t. Tweedie’s forecast dominance, see Definition 4.20. Third, 
we calculate the AIC values for all models. The gamma and the inverse Gaussian 
cases have a closed-form solution for the normalizing term a(y; v/g) in the EDF 
density, and we can directly calculate AIC. The case p = 2.5 is more difficult 
and we use the saddlepoint approximation of Sect. 5.5.2. Considering AIC we give 
preference to Tweedie’s GLM with p = 2.5. Note that the AIC values use the 
MLE for ø which is obtained from a general purpose optimizer, and which uses 
the saddlepoint approximation in the power variance case p = 2.5. Fourth, under 
a constant dispersion parameter gv, the mean estimation fi; can be done without 
explicitly specifying ọ because it cancels in the score equations. In fact, we perform 
this mean estimation in the additive form and not in the reproductive form, see (2.13) 
and the discussions in Sects. 5.3.7-5.3.8. 

Figure 11.2 plots the deviance residuals (for unit dispersion) against the logged 
fitted means 7(x;) for p € {2,2.5,3} for 2’000 randomly selected claims; this 
is the Tukey—Anscombe plot. The green line has been obtained by a spline fit 
to the deviance residuals as a function of the fitted means ji(x;), and the cyan 
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Fig. 11.2 Tukey—Anscombe plots showing the deviance residuals against the logged GLM fitted 
means /i(x;): (hs) gamma GLM p = 2, (middle) power variance case p = 2.5, (rhs) inverse 
Gaussian GLM p = 3; the cyan lines show twice the estimated standard deviation of the deviance 
residuals as a function of the size of the logged estimated means i 


lines give twice the estimated standard deviation of the deviance residuals as 
a function of the fitted means (also obtained from spline fits). This estimated 
standard deviation corresponds to the square-rooted deviance dispersion estimate 
Pp, see (5.30), however, in the additive form because we work with unscaled claim 
size observations. A constant dispersion assumption is supported by cyan lines of 
roughly constant size. In the gamma case the dispersion seems increasing in the 
mean estimate, and in the inverse Gaussian case it is decreasing, thus, the power 
variance parameters p = 2 and p = 3 do not support a constant dispersion in this 
example. Only the choice p = 2.5 may support a constant dispersion assumption 
(because it does not have an obvious trend). This says that the variance should scale 
as V (u) = peo as a function of the mean p, see also (11.5). 


Deep FN Networks 


We compare the above GLMs to FN networks of depth d = 3 with (q1, q2, q3) = 
(20, 15, 10) neurons. The categorical features are modeled with embedding layers 
of dimension b = 2. We fit this network architecture with Tweedie’s deviances 
losses having power variance parameters p € {2, 2.5, 3}. Moreover, we use 20% 
of the learning data £ as validation data V to explore the early stopping rule.! To 
reduce the randomness coming from early stopping with different seeds, we average 
the deviance losses over 20 runs (this is not the nagging predictor: we only average 
the deviance losses to have stable conclusions concerning forecast dominance). The 
results are presented in Table 11.2. 


l In the standard implementation of SGD with early stopping, the learning and validation data 
partition is done non-stratified. If necessary, this can be changed manually. 
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Table 11.2 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss 
(in 107°) and inverse Gaussian (IG) loss (in 10~3)) and average claim amounts; the losses use unit 
dispersion g = 1 and the network losses are averaged deviance losses over 20 runs with different 
seeds 


In-sample loss on £ Out-of-sample loss on T Average 
Vp=2 0 p=2.5 dp=3 dp=2 0 p=2.5 0p=3 claim 
Null model 3.0094 | 10.2208 | 4.6979 | 3.0240 | 10.2420 | 4.6931 |1774 


Gamma GLM 2.0695 7.7127 | 3.9582 | 2.1043 7.1852 | 3.9763 |1701 
p = 2.5 GLM 2.0744 7.6971 | 3.9433 |2.1079 7.7635 |3.9580 |1652 
IG GLM 2.0865 7.7069 | 3.9398 |2.1191 7.7130 |3.9541 |1614 
Gamma network | 1.9738 7.4556 | 3.8693 | 2.0543 7.6478 |3.9211 | 1°748 
p = 2.5 network | 1.9712 74128 | 3.8458 | 2.0654 7.6551 | 3.9178 |1739 
IG network 1.9977 74568 | 3.8525 | 2.0762 7.6682 | 3.9188 |1712 


First, we observe that the networks outperform the GLMs, saying that the feature 
engineering has not been done optimally for GLMs. Second, in-sample we no longer 
receive the lowest deviance loss in the model with the same p. This comes from the 
fact that we exercise early stopping, and, for instance, the gamma in-sample loss of 
the gamma network (p = 2) 1.9738 is bigger than the corresponding gamma loss 
of 1.9712 from the network with p = 2.5. Third, considering forecast dominance, 
preference is given either to the gamma network or to the power variance parameter 
p = 2.5. In general, it seems that fitting with higher power variance parameters 
leads to less stable results, but this statement needs more analysis. The disadvantage 
of this fitting approach is that we independently fit the models with the different 
power variance parameters to the observations, and, thus, the learned representations 
zD (x;) are rather different for different p’s. This makes it difficult to compare 
these models. This is exactly the point that we address next. 


Robustified Representation Learning 


To deal with the drawback of missing comparability of the network approaches 
with different power variance parameters, we can try to learn a representation 
that simultaneously fits different models. The implementation of this idea is rather 
straightforward in network modeling. We choose the above network of depth d = 3, 
which gives us the new (learned) representation Z; = zD (xi) in the last FN 
layer. The general idea now is that we design multiple outputs for this learned 
representation to fit the different distributional models. That is, in the case of 
three Tweedie’s loss functions with power variance parameters p € {2, 2.5, 3} we 
consider a three-dimensional output mapping 


x > (pao), Mpa2.s(*), Mp=a(x)) | (11.6) 
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for different output parameters B>,B>5,8; € IR%%*!. These three expected 
responses (11.6) share the network parameters w = w, head wi) in the FN 
layers, and the network fitting should learn these parameters such that z; = 
zD (x;) gives a good representation for all considered loss functions. Choose 
positive weights 7, > 0, and define the combined deviance loss function 


D (Y, w, Bo. B2s,B3)) = Y PYE vop (¥i upd), 01 


pe{2,2.5,3} Pp i 


for the given observations (Yj, xi, vj), | < i < n. Note that the unit deviances 
Dp live on different scales for different p’s. We use the (constant) weights np > 0 
to balance these scales so that all power variance parameters p roughly equally 
contribute to the total loss, while setting øp = 1 (which can be done for a constant 
dispersion). This approach is now fitted to the available learning data £. The 
corresponding R code is given in Listing 11.1. Note that the fitting also requires that 
we triplicate the observations (Y;, Y;, Y;) so that we can simultaneously evaluate the 
three chosen power variance deviance losses, see lines 18—21 of Listing 11.1. We 
fit this model to the Swiss accident insurance data, and the results are presented in 
Table 11.3 on the lines called ‘multi-out’. 


Listing 11.1 FN network with multiple output 


Design = layer input (shape = c(q0), dtype = ‘float32’, name = ‘Design’ ) 
# 
Network = Design %>% 
ayer _dense(units=20, activation=’tanh’, name=’FNLayerl’) %>% 
layer _dense(units=15, activation=’tanh’, name=’FNLayer2’) %>% 
ayer _dense(units=10, activation=’tanh’, name=’FNLayer3’ ) 
# 
Outputl = Network %>% 
ayer _dense(units=1, activation=’exponential’, name=’Output1’) 
# 
Output2 = Network %>% 
layer_dense(units=1, activation=’exponential’, name=’Output2’ 
# 
Output3 = Network %>% 
layer_dense(units=1, activation=’exponential’, name=’Output3’ 
# 


keras_model(inputs = c(Design), outputs = c(Outputl, Output2, Output3)) 
# 
model %>% compile(loss = list(loss1, loss2, loss3), 
loss_weights=list(etal, eta2, eta3), optimizer = ‘nadam’) 


This simultaneous representation learning across different loss functions leads to 
more stability in the results between the different loss function choices, i.e., there 
is less variability between the losses of the different outputs compared to fitting the 
three different models independently. The predictive performance seems slightly 
better in this robustified vs. the independent case (see bold face out-of-sample 
figures). The similarity of the results across the different loss functions (using the 
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Table 11.3 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss 
(in 107°) and inverse Gaussian (IG) loss (in 1073)) and average claim amounts; the losses use unit 
dispersion g = | and the network losses are averaged deviance losses over 20 runs with different 
seeds 


Average 
claim 
4.6001 | PTH 
39211 
39178 
39188 [1712 
3.9146 |1745 
r72 
1705 
r74 


comparison of gamma, p=2.5 and inverse Gauss comparison of gamma, p=2.5 and inverse Gauss 


Null model 
Gamma network 


p = 2.5 network 
IG network 
Gamma multi-output (11.6) 
p = 2.5 multi-output (11.6) 
IG multi-output (11.6) 
Multi-loss fitting (11.8) 
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Fig. 11.3 Ratios p= (x;)/Hp=2.5(x;) (black color) and 1,3 (x;)/A p=2.5 (x; ) (blue color) of the 
three predictors (lhs) in-sample figures ordered on the x-axis w.r.t. the logged observed claims Y;, 
darkgray and cyan lines give spline fits, (rhs) out-of-sample figures ordered on the x-axis w.r.t. the 
logged average size of the three predictors 


jointly learned representation z;) allows us to directly compare the corresponding 
predictors @p(x;) for the different p’s. 

Figure 11.3 compares the three predictors by considering the ratios 
Ee p=2(*i)/Mp=2.5(x;) in black color and Ap=3(x;)/@p=2.5(x;) in blue color, i.e., 
we divide by the (middle) predictor with power variance parameter p = 2.5. 
The figure on the left-hand side shows these ratios in-sample and ordered on 
the x-axis w.r.t. the observed claim sizes Y;, and the darkgray and cyan lines 
give spline fits to these ratios. The figure on the right-hand side shows these 
ratios out-of-sample and ordered on the x-axis w.rt. the average predictors 
Li = (p=2(xi) + Mp=2.5s(xi) + Mp=3(x;))/3. In view of (11.5) we expect that the 
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models with a smaller power variance parameter p over-fit more to large claims. 
From Fig. 11.3 (lhs) we can observe that, indeed, this is the case (see gray and cyan 
spline fits which bifurcate for large claims). That is, models with a smaller power 
variance parameter react more sensitively to large observations Y;. The ratios in 
Fig. 11.3 provide differences of up to 7% for large claims. 


Remark 11.3 The loss function (11.7) can also be interpreted as regularization. 
For instance, if we choose n2 = 1, and if we assume that this is our preferred 
model, then we can regularize this model with further models, and their weights 
Np > O determine the degree of regularization. Thus, in contrast to ridge and 
LASSO regularization of Sect. 6.2, regularization does not directly act on the 
model parameters, here, but rather on what we learn in terms of the representation 
eee. 


Using Forecast Dominance to Deal with Model Uncertainty 


In GLMs, the power variance parameter p typically acts as a hyper-parameter, i.e., 
one fits different GLMs for different choices of p. Model selection is then done, e.g., 
by analyzing the Tukey—Anscombe plot, AIC, cross-validation or by studying out- 
of-sample forecast dominance. In networks we should not use AIC as we neither 
have a parsimonious network parameter nor do we use the MLE. Here, we focus 
on forecast dominance for the network predictors (based on the different chosen 
power variance parameters). If we are mainly interested in receiving a model that 
provides optimal forecast dominance, we should not consider three different outputs 
as in (11.7), but rather fit the same output to different loss functions; the required 
changes are minimal, see Listing 11.2. Namely, consider one FN network with one 
output u(x;), but evaluate this output simultaneously on the different chosen loss 
functions 


DY, = P BY uw. a). (11.8) 


pe{2,2.5,3} “P i=l 


In contrast to (11.7), we only have one FN network regression function x; œ> u(xi), 
here. 

We present the results on the last line of Table 11.3, called ‘multi-loss’. In our 
case, this approach is slightly less competitive (out-of-sample), however, it is less 
sensitive to outliers since we need to have a good regression function simultaneously 
for multiple loss functions. Of course, this multiple loss fitting approach is not 
restricted to different power variance parameters. As stated in Theorem 4.19, 
Bregman divergences are the only consistent loss functions for mean estimation, 
and the unit deviances are examples of Bregman divergences. Forecast dominance 
now suggests that we may choose any Bregman divergence as a loss function in 
Listing 11.2 as long as it reflects the expected properties of the model (and of 
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Listing 11.2 FN network with a single output for multiple losses 


Design = layer_input(shape = c(q0), dtype = ‘float32’, name = ‘Design’ ) 
# 
Network = Design %>% 

layer_dense(units=20, activation=’tanh’, name='FNLayerl’) %>% 


layer_dense(units=15, activation=’tanh’, name=’FNLayer2’) %>% 
layer_dense(units=10, activation=’tanh’, name=’FNLayer3’ ) 
# 
Output = Network %>% 
layer_dense(units=1, activation='’exponential’, name=’Output’ ) 
# 
keras_model(inputs = c(Design), outputs = c(Output, Output, Output) ) 
# 
model %>% compile(loss = list(loss1, loss2, loss3), 
loss_weights=list(etal, eta2, eta3), optimizer = ‘nadam’) 


the observed data), otherwise we will receive bad convergence properties, see also 
Sect. 11.1.4, below. For instance, we can robustify the Poisson claim counts model 
by additionally considering the deviance loss of the negative binomial model that 
also assesses over-dispersion. 


Nagging Predictor 


The loss figures in Table 11.3 are averaged deviance losses over 20 different runs of 
the gradient descent algorithm with different seeds (to receive stable results). Rather 
than averaging over the losses, we should improve the models by averaging over the 
predictors and, then, calculate the losses on these averaged predictors; this is exactly 
the proposal of the nagging predictor (7.44). We calculate the nagging predictor of 
the models that are simultaneously fit to the different loss functions (lines ‘multi- 
output’ and ‘multi-loss’ of Table 11.3). The resulting nagging predictors are reported 
in Table 11.4. This table shows that we give a clear preference to the nagging 
predictors. The simultaneous loss fitting (11.8) gives the best out-of-sample results 
for the nagging predictor, see the last line of Table 11.4. 

Figure 11.4 shows the Tukey—Anscombe plot of the multi-loss nagging predictor for 
the different deviance losses (for unit dispersion). Again, the case p = 2.5 is closest 
to having a constant dispersion, and the other cases will require dispersion modeling 
p(x). 

Figure 11.5 shows the empirical auto-calibration property of the multi-loss nagging 
predictor. This auto-calibration property is calculated as in Listing 7.8. We observe 
that the auto-calibration property holds rather accurately. Only for claim predictors 
T(xi) above 10’000 CHF (vertical dotted line in Fig. 11.5) the fitted means under- 
estimate the observed average claim sizes. This affects (only) 1.7% of all claims and 
it could be corrected as described in Example 7.19. 
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Table 11.4 In-sample and out-of-sample losses (gamma loss, power variance case p = 2.5 loss 
(in 107°) and inverse Gaussian (IG) loss (in 10~7)) and average claim amounts; the losses use unit 


dispersion g = 1 
Average 

claim 
Null model 4.6931 | 1°774 
Gamma multi-output (11.6) 3.9146 | 1°745 
p = 2.5 multi-output (11.6) 3.9139 | 1°732 
IG multi-output (11.6) 3.9134 | 1705 
Multi-loss fitting (11.8) 3.9144 | 1°744 
Gamma multi-out & nagging 3.8864 | 1°745 
p = 2.5 multi-out & nagging 3.8864 | 1°732 
IG multi-out & nagging 3.8865 | 1°705 
Multi-loss with nagging 3.8837 | 1°744 


Tukey-Anscombe plot: gamma Tukey-Anscombe plot: p=2.5 Tukey-Anscombe plot: inverse Gaussian 
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Fig. 11.4 Tukey—Anscombe plots giving the deviance residuals of the multi-loss nagging predic- 
tor of Table 11.4 for different power variance parameters: (Ihs) gamma deviances p = 2, (middle) 
power variance deviances p = 2.5, (rhs) inverse Gaussian deviances p = 3; the cyan lines show 
twice the estimated standard deviation of the deviance residuals as a function of the size of the 
logged estimated means {i 


11.1.3 Lab: Deep Dispersion Modeling 


From the Tukey—Anscombe plots in Fig. 11.4 we conclude that the dispersion 
requires regression modeling, too, as the dispersion does not seem to be constant 
over the whole range of the expected claim sizes. We therefore explore a double FN 
network model, in spirit this is similar to the double GLM of Sect. 5.5. We therefore 
assume to work within Tweedie’s family with power variance parameters p > 2, and 
with unit deviances given by (11.2)-(11.3). The saddlepoint approximation (5.59) 
gives us 


op, w} , 
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Fig. 11.5 Empirical auto-calibration of network prediction 
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formulated in the reproductive form for Y = X/@ = Xq@/v. This requires scaling of 
the observations X with the unknown ¢ to receive Y. In Sect. 5.5.4 we have shown 
how this problem can be solved. In this section we give a different proposal which 
is more robust in network fitting, and which benefits from the b-homogeneity of dp, 
see (11.4). 

We consider the variable transformation y +> x = yw = yv/ọ. In the absolutely 
continuous case p > 2 this gives us the approximation 


—1/2 
2r! tP 1 XP ugv p 

39, os y “m v’ gv v 

f (x; 9, v/@) ( ea w) exp| Jpj (Z pu fe 


2ngP! aa 1 
= (rvw) exp l-an dp (x, u»)| 


with mean up = uv/ọ of X = Yv/ọ. We set 6 = —1/g?—! < 0. This gives us the 
approximation 


v?d9(X, Lp) — (log (—9)) 1 


£ p 7X 
x(Hp $) 7 7 


log (Ave) . (119) 


For given mean up we again have a gamma approximation on the right-hand side, 
but we scale the dispersion differently. This gives us the approximate first moment 


a —] def. 
ag | v” Onu a] ~ le) = -1/0 = p & o. 


The remainder of this modeling is similar to the residual MLE approach in 
Section 5.5.3. Namely, we set up two FN network regression functions 


x upa) and x pp) = KA(G(X)) = -1/0 (8). 
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Parameter fitting is achieved by alternating the network parameter fitting of u p(x) 
and g(x) see also Section 5.5.4. We start the iteration by setting the dispersion 
constant to Pp G(x) = const. In this case, the dispersion cancels in the score 
equations and the mean fp (x) can et a without the explicit knowledge 
of the (constant) dispersion parameter pí n ; this exactly provides the results of the 
previous Sect. 11.1.2. Then, we iterate this procedure for t > 1. For given mean 
estimate 4i nO (x) we receive deviances v?—!d p(X, nP (x)), and this allows us to 
estimate p OW (x) from the approximate gamma model (11.9), and for given disper- 
sion parameters OW (x) we estimate eG) from the corresponding Tweedie’s 
model for the observation X. 


Example 11.4 We revisit the Swiss accident insurance data example of Sect. 11.1.2, 
and we use the robustified representation learning approach (11.7) that simulta- 
neously fits Tweedie’s models for the power variance parameters p = 2, 2.5, 3. 
The initial calibration step is done for constant dispersions oy (x) = const, and 
it provides us with the estimated means RD (x) as illustrated in Fig. 11.3. For 
stability reasons we choose the nagging predictor averaging over 20 different SGD 
runs with 20 different seeds. These estimated means nP (x) give us the deviances 
vP~1dp(X, ip (x)). 

Using these deviances ae us to alternate the dispersion and mean estimation 
fort > 1. For given means me (x), p = 2,2.5,3, we set up a deep FN network 
xb gd: D(x) that allows for a robustified deep dispersion learning y,(x), for 
p = 2,2.5,3. Under the log-link choice we consider the regression function with 
multiple outputs 


x > (ypa2(X), Gp=2.5(X), Yp=3(x)) | (11.10) 


= (explæz, 2), explers, 25D (a), exples, z D0) RY, 


for different output parameters 2, 025,03 € IR%*!. These three a 
responses (11.10) share the common network parameter w = = (wl, , Wg (2) ) in 
the FN layers of z““!). The network fitting learns these parameters aa 
for the different power variance parameters. Choose positive weights 7, > 0, and 
define the combined deviance loss function (based on the gamma model «2 and 


having dispersion parameter 2) 


~ n 
A ~ n — A 
D (0K, A), G, a2, 025,03)) = D Z Sd (vf p(X. Ad). vp), 
pe{2,2.5,3} i=l 


(11.11) 
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where X = (Xj,..., Xn) collects the unscaled observations X; = Y;v;/g;. Thus, 
for all power variance parameters p = 2, 2.5,3 we fit a gamma model 09(., -)/2 
to the observed deviances (observations) uP -la p(Xi, nP æ) providing us with 
the estimated dispersions 9) (xi). This fitting step is received by the R code 
of Listing 11.1, where the losses on line ps a all given by gamma deviance 
losses (11.11) and the deviances uP 
(observations). 

In the next step we update the mean estimates jz pet) (x;), given the estimated 


"D (Xi, Tis (x; )) play the role of the responses 


dispersions PP (xi) from the previous step. This requires that we optimize the 
expected responses (11.6) for given heterogeneous dispersion parameters. We 
therefore consider the loss function for positive weights np > 0, see (11.7), 


D (x.90, (w, Ba, Bs, B3)) = >. Np D PTE dp (Xi, Mp(xi)). 


pe{2,2.5,3} i=1 Pp Wi 
(11.12) 


We fit this model by iterating this approach for t > 1: we ip from the predictors 
of Sect. 11.1.2 providing us with the first mean estimates Ly p Ox; ). A on these 


mean estimates we iterate this robustified estimation of @ OY (x: ) and i a (xj). We 
give some remarks: 


1. We use the robustified versions (11.11) and (11.12), respectively, where we 
simultaneously fit all power variance parameters p = 2, 2.5, 3 on the commonly 
learned representations z; = z‘“*!)(x;) in the last FN layer of the mean and the 
dispersion network, respectively. 

2. For both FN networks of mean jz and dispersion g modeling we use the same 
network architecture of depth d = 3 having (q1, q2, q3) = (20, 15, 10) neurons 
in the FN layers, the hyperbolic tangent activation function, and the log-link 
for the output. These two networks only differ in their network parameters 
(w, B2, B25, B3) and (Ù, 2, «2.5, @3), respectively. 

3. For fitting we use the nadam version of SGD. For the early stopping we use a 
training data U/ to validation data V split of 8 : 2. 

4. To ensure consistency within the individual SGD runs across t > 1, we use the 
learned network parameter of loop t as initial value for loop t + 1. This ensures 
monotonicity across the iterations in the log-likelihood and the loss function, 
respectively, up to the fact that the random mini-batches in SGD may distort this 
monotonicity. 

5. To reduce the elements of randomness in SGD fitting we run this iteration 
procedure 20 times with different seeds, and we output the nagging predictors 
for BO (x i) and @ p Oy ij) averaged over the 20 runs for every t in Table 11.5. 


We iterate this algorithm over two ee and the results are presented in Table 11.5. 
We observe a decrease of —2¢ x(a ,G Pp W) by iterating the fitting algorithm for t > 


1. For AIC, we would have to correct twice the negative log-likelihood by twice 
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Table 11.5 Iteration of mean my and dispersion oy ) estimation for the gamma model p = 2, 
the power variance parameter p = 2.5 model and the inverse Gaussian model p = 3: the numbers 
correspond to -Ux (A, p0); the last line corrects -Ux AN, OW) by 2-2-812 = 3/248 (twice 


the number of parameters used in the mean and dispersion FN networks) 


Iteration —2. log-likelihood 

t Gamma p = 2 Power variance p = 2.5 Inverse Gaussian p = 3 
aP, Gg) 4722961 4635038 4644869 

GB, gV) 4702247 | 4622097 [4617593 

GB, gV) 4701234 46217123 4616869 

a2, 9P) 4’700’ 686 4’620°845 4616588 

“AIC” 4°703°978 4624137 | #619880 


the number of MLE estimated parameters. We also adjust here correspondingly, 
though the correction is not justified by any theory, because we do not work with 
the MLE nor do we have a parsimonious model for mean and dispersion estimation. 
Nevertheless, we receive smaller values than in Table 11.1 which supports the use 
of this more complex double FN network model. 

Comparing the three power variance parameter models, we now give preference 
to the inverse Gaussian model, as it has the biggest log-likelihood. Note that we 
directly compare all power variance models as the complexity is equal in all models 
(they only differ in the chosen power variance parameter) and the joint robustified 
fitting applies the same stopping rule to all power variance parameter models. The 
same result is obtained by comparing the out-of-sample log-likelihoods. Note that 
we do not compare the deviance losses, here, because the unit deviances are not 
designed to estimate parameters in vector-valued parameter families; we model 
dispersion as a second parameter. 

Next, we study the estimated dispersions @,(x;) as a function of the estimated 
means {1 p(x;). We fit a spline to @,(x;) as a function of p(x;), and we receive 
estimates that almost perfectly match the cyan lines in Fig. 11.4. This provides 
a proof of concept that the dispersion regression model finds the right level of 
dispersion as a function of the expected means. 

Using the mean and dispersion estimates, we can calculate the dispersion scaled 
deviance residuals 


rP = sign(X; — jp(ai))y vP" (Xi, p(x) pi). (11.13) 


This then allows us to give the Tukey—Anscombe plots for the three considered 
power variance parameters. 

The corresponding plots are given in Fig. 11.6; the difference to Fig. 11.4 is that 
the latter considers unit dispersion whereas the former scales the residuals with 
the rooted dispersion J Opi) ; note that v; = 1 in this example. By scaling with 
the rooted dispersion the resulting deviance residuals rP should roughly have unit 
standard deviation. From Fig. 11.6 we observe that indeed this is the case, the cyan 
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Tukey-Anscombe plot: gamma Tukey-Anscombe plot: p=2.5 Tukey-Anscombe plot: inverse Gaussian 
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Fig. 11.6 Tukey—Anscombe plots giving the dispersion scaled deviance residuals rP (11.13) of 
the models jointly fitting the mean parameters ip (x;) and the dispersion parameters Pp (xi): (hs) 
gamma model, (middle) power variance parameter p = 2.5 model, and (rhs) inverse Gaussian 
models; the cyan lines correspond to 2 standard deviations 

inverse Gaussian: fitted model vs. observations 


gamma model: fitted model vs. observations gamma model: estimated shape parameters 
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Fig. 11.7 (lhs) Gamma model: observations vs. simulations on log-scale, (middle) gamma model: 
estimated shape parameters a =1 /@(x}) < 1,1 < t < T, and (ths) inverse Gaussian model: 
observations vs. simulations on log-scale 


line shows a spline fit of twice the standard deviation of the deviance residuals rP. 
These splines are of magnitude 2 which verifies the unit standard deviation property. 
Moreover, the cyan lines are roughly horizontal which indicates that the dispersion 
estimation and the scaling works across all expected claim sizes (4p (x;). The three 
different power variance parameters p = 2, 2.5, 3 show different behaviors in the 
lower and upper tails in the residuals (centering around the orange horizontal zero 
line in Fig. 11.6) which corresponds to the different distributional properties of the 
chosen models. 

We further analyze the gamma and the inverse Gaussian models. Note that the 
analysis of the power variance models for general power variance parameters p # 
0, 1, 2, 3 is more difficult because neither the EDF density nor the EDF distribution 
function have a closed form. To analyze the gamma and the inverse Gaussian models 
we simulate observations yan, t = 1,..., T, from the estimated models (using the 
out-of-sample features xÍ of the test data 7), and we compare them against the 


true out-of-sample observations X T . Figure 11.7 shows the results for the gamma 
model (lhs) and the inverse Gaussian model (rhs) on the log-scale. A good fit has 
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been achieved if the black dots lie on the red diagonal line (in the colored version), 
because then the simulated data shares similar features as the observed data. The fit 
of the inverse Gaussian model seems reasonably good. 

On the other hand, we see that the gamma model gives a poor fit, especially 
in the lower tail. This supports the AIC values of Table 11.5. The problem with 
the gamma model is that the data is more heavy-tailed than the gamma model can 
accomplish. As a consequence, the dispersion parameter estimates 2 (x!) in the 
gamma model are compensating for this by taking values bigger than 1. A dispersion 
parameter bigger than 1 implies a shape parameter in the gamma model of a = 
1/02 a$) < 1, and the resulting gamma density is strictly decreasing, see Fig. 2.1. If 
we simulate from this model we receive many observations X$™ close to zero (from 
the strictly decreasing density). This can be seen from the lower-left part of the graph 
in Fig. 11.7 (lhs), suggesting that we have many observations with x e (0, 1), oron 
the log-scale log(X ty < 0. However, the graph shows that this is not the case in the 
real data. Figure 11.7 (middle) shows the boxplot of the estimated shape parameters 
al on the test data, 1 < t < T, verifying that most insurance policies of the test data 
T receive a shape parameter a less than 1. 

We conclude that the inverse Gaussian double FN network model seems to work 
well for this data, and we give preference to this model. a 


11.1.4 Pseudo Maximum Likelihood Estimator 


This short section gives a mathematical foundation to parameter estimation under 
model uncertainty and model misspecification. We summarize the results of 
Gourieroux et al. [168], and we refrain from giving any proofs in this section. 
Assume that the real-valued observations Y;, 1 < i < n, have been generated by the 
model 


Yj = Mey (Xi) + £i, (11.14) 


with (true) parameter ¢ọ € A C R”, feature x; € X C {1} x R4, and where 
the conditional distribution of the noise random variables (¢€;))<j<, satisfies the 
conditional independence property pe(€1,..-.€n|X1,---.¥n) = [jy pe(eilxi)- 
Denote by px (x) the portfolio distribution of the features x. Thus, under (11.14), the 
claim Y of a randomly selected policy is generated by the joint probability measure 
De.x(€,X) = De(€|X) px (x). The technical assumptions under which the following 
statements hold are given in Assumption 11.9 at the end of this section. 

Let Fo(-|x;) denote the true conditional distribution of Y;, given x;. Typically, 
this (true) conditional distribution is unknown. It is assumed to provide the first two 
conditional moments 


be [Yil xi] = Holi) and Varg (¥;| x) = of (xi). 
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Thus, £;|x; is assumed to be centered with conditional variance og (xi), see (11.14). 
Our goal is to estimate the (true) parameter ¢ọ € A, based on the fact that the 
conditional distribution Fo(-|x) of the observations is unknown. Throughout we 
assume parameter identifiability, i.e., if u¿ (x) = U(x), px-a.s., then f) = fo. 
The following estimator is called pseudo maximum likelihood estimator (PMLE) 


A 1d 
¢PMLE _ arg min . Sj, we (*i)), (11.15) 
CEA i=1 


where d(y, u) is the unit deviance of a (pre-chosen) single-parameter linear EDF 
being parametrized by the same parameter space A C R” as the original random 
variables (11.14); note that A is not the effective domain © of the chosen EDF. 
(PMLE is called PMLE because it is a MLE for ¢ọ € A, but not in the right 
model, because the pre-chosen EDF in (11.15) typically differs from the (unknown) 
true conditional distribution Fo(-|x). Nevertheless, we may hope to find the true 
parameter ¢o, but possibly at a slower asymptotic rate. This is exactly what is going 
to be stated in the next theorems. 


Theorem 11.5 (Theorem 1 of Gourieroux et al. [168]) Denote by M = «'(®) 
the dual mean parameter space of the pre-chosen EDF (having cumulant function 
k), and assume that u¿ (x) € M for allx € X andé € A. Let Assumption 11.9, 
below, hold. The PMLE TPMLE is strongly consistent for Co, i.e., it converges a.s. as 
n —> œ. 


This theorem tells us that we can perform MLE in a pre-chosen EDF (which may 
differ from the true data model), and asymptotically we find the true parameter ¢o 
of the data model Fo(-|x). Of course, this uses the fact that any unit deviance 0 is 
a strictly consistent loss function for mean estimation, see Theorem 4.19. We do 
not only receive consistency, but the following theorem also gives us the rate of 
convergence. 


Theorem 11.6 (Theorem 3 of Gourieroux et al. [168]) Set the same assumptions 
as in Theorem 11.5. The PMLE ç MEE has the following asymptotic behavior 


Ji (GPMLE to) = N (0, COTTET Co!) forn > 00, 


with the following matrices evaluated in Ẹ = ġo 


TO) = Er [T] = Er [IET AOI) € RY, 


DO = Er [IETRO] E R”, 
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where h = (x')~! is the canonical link of the pre-chosen EDF, and with the change 
of variable ¢ +> 0 = 0 (¢) = h(uz¢(x)) € O, for given feature x, having Jacobian 


ð , 
J(¢; x) = (ro) = (Ve uz Œœ) c REPY. 


1 
dtk l<k<r K" (h(u¢(x)) 


Remark that Z*(¢) averages Fisher’s information Z*(¢; x) (of the chosen EDF) 
over the feature distribution py. This theorem can be seen as a modification of (3.36) 
to the regression case. Theorem 11.6 gives us the asymptotic normality of the 
PMLE, and the resulting asymptotic variance depends on how well the pre-chosen 
EDF matches the true data distribution Fo(-|x). The following lemma corresponds 
to Property 5 in Gourieroux et al. [168]. 


Lemma 11.7 The asymptotic variance in Theorem 11.6 has the lower bound, set 
¢ = ġo and 0° (x) = 09 (x), 


-1 
Tey SOV > HE) = Ex [Ven œo) (Vene(x)) | Ee R, 


Proof We set t7(x) = K" (h(ut(x))). We have J (¢; x)! = Vewe(x)t~7(x). The 
following matrix is positive semi-definite and it satisfies 


By| [OTIC T — HOICG T aa) 07) 


m 
x [TOTI ET -HOI HP @o7*@)] | 
= POTEO OT! — HOT OTOT! — TOTOHE + HOHE) HE) 


= TOTT EOT OT! — HE). 


This proves the claim. o 


Theorem 11.6 and Lemma 11.7 tell us that if we estimate the parameter fo of 
the unknown model Fo(-|x) with PMLE based on a single-parameter linear EDF, 
we receive minimal asymptotic variance if we can match the variance V (ui (x)) = 
K"(h (ue (x))) of the chosen EDF with the variance og (x) of the true data model. 
E.g., if we know that the variance in the true model behaves as og (x) = IA (x) 
we should select the inverse Gaussian model with variance function V (u) = p? for 
PMLE. 

If the members of the single-parameter linear EDF do not fully match the 
variance structure of the true data, we can turn our attention to a dispersion submodel 
as in Sect. 5.5.1. Assume for the variance structure of the true data 


1 
Vargo (Yi lxi) = 09 (Fi) = Seq (Xi), 
tÀ 
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for a regression function x +> s2 (x) involving the (true) regression parameter œo 
and exposures v; > 0. If we choose a fixed EDF, we have the log-likelihood function 


(U,~) => ly(u, 9; v) = 5 Yh) = KADI H aO v/Q). 


Equating the variance structure of the true data model with the variance in this pre- 
specified EDF, we obtain feature-dependent dispersion parameter 


2 
s4 (xi) 
pan = —— 1, (11.16) 
V (Hio (Xi) 
with variance function V (u) = («” o h)(u). The following theorem proposes a 


two-step procedure for this estimation problem. 


Theorem 11.8 (Theorem 4 of Gourieroux et al. [168]) Assume Tn and ŭn are 
strongly consistent estimators for ¢o and ao, as n — œ, such that nEn — ġo) and 
»/N(@n — a9) are bounded in probability. The quasi-generalized pseudo maximum 
likelihood estimator (QPMLE) of & is obtained by 


OPMLE we Ge s2 (xi) 
la = arg max y; | uc (xi), ————: vi |. 
i fen S DTT 


>QPMLE 


Under Assumption 11.9, below, fy is strongly consistent and best asymptoti- 
cally normal, i.e., 


Ji (EQPMEE to) = NOHO forn > ov. 


This justifies the approach(es) in the previous chapters and sections, though, 
not fully, because we neither work with the MLE in FN networks nor do we 
care about identifiability in parameters. Nevertheless, this short section suggests 
to find strongly consistent estimators ce and @, for ¢o and ag. This gives us a first 
model calibration step that allows us to specify the dispersion structure x +> (x) 
via (11.16). Using this dispersion structure and the deviance loss function (4.9) for 
a variable dispersion parameter g(x), the QPMLE is obtained in the second step by, 
we replace the likelihood maximization by the deviance loss minimization, 


he vi 
FQPMLE _ . i l 
i = argmin — ò ————————_ 0(Jj, tc (X)). 
" gmin | DZ eniVuga) _ 
This QPMLE is best asymptotically normal, thus, asymptotically optimal within the 
EDF. There might still be better estimators for ¢o, but these are outside the EDF. 
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If we turn M-estimation into Z-estimation we have the requirement for ¢, see 
also (11.5), 


n 
1 D VUR ED) Ymd y u Lo 
w sz (xi) Vura) 
Thus, it all boils down to find the right variance structure to receive the optimal 
asymptotic behavior. 
The previous statements hold true under the following technical assumptions. 
These are taken from Appendix | of Gourieroux et al. [167], and they are an adapted 
version of the ones in Burguete et al. [61]. 


Assumption 11.9 


(i) u¿(x) and O(y, u¿(x)) are continuous w.rt. all variables and twice continu- 
ously differentiable in ¢ ; 

(ii) A C R” is a compact set and the true parameter to is in the interior of A; 

(iii) almost every realization of (£i, xi) is a Cesdro sum generator w.rt. the 
probability measure Pe,x(€, xX) = Pe(E|X) px(x) and to a dominating function 
b(e, x); 

(iv) the sequence (xj); is a Cesàro sum generator wrt. py and b(x) = 
Jp bE, x)dpe(Elx); 

(v) for each x € {1} x R4, there exists a neighborhood Ny C {1} x RI such that 


f sup b(e, x’) dps(E|x) < 00; 
Rx’eNy 


(vi) the functions d(Y, ue (x)), d(Y, We (x))/3 tk, 3Y, uc (x))/Ətkð tı are dom- 
inated by b(e, x). 


11.2 Deep Quantile Regression 


So far, in network regression modeling, we have not addressed the question of 
prediction uncertainty. As mentioned in Remarks 4.2 on forecast evaluation, there 
are different sources that contribute to prediction uncertainty. There is the model 
and parameter estimation uncertainty, which may result in an inappropriate model 
choice, and there is the irreducible risk which comes from the fact that we forecast 
random variables which inherit a natural randomness that cannot be controlled. 

We have discussed methods of evaluating model and parameter estimation error, 
such as the asymptotic normality of MLEs within GLMs, and we have discussed 
forecast dominance, the bootstrap method or the nagging predictor that allow 
one to assess the different sources of prediction uncertainty. However, we have 
not explicitly quantified these sources of uncertainty within the class of network 
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regression models. We do an attempt in Sect. 11.4, below, by considering the 
fluctuations generated by bootstrap simulations. The irreducible risk can be assessed 
once we have a suitable statistical model; in Example 11.4 we have studied a 
gamma and an inverse Gaussian model on an explicit data set, and these models 
can be used, e.g., to calculate quantiles. In this section we consider a distribution- 
free approach that directly estimates these quantiles. Recall from Section 5.8.3 that 
quantiles are elicitable with the pinball loss as a strictly consistent loss function, see 
Theorem 5.33. This allows us to directly estimate the quantiles from the data. 


11.2.1 Deep Quantile Regression: Single Quantile 


In this section we present a way of assessing the irreducible risk which does not 
require a sophisticated model evaluation of distributional assumptions. Quantile 
regression is increasingly used in the machine learning community because it is 
a robust way of quantifying the irreducible risk, we refer to Meinshausen [270], 
Takeuchi et al. [350] and Richman [314]. We recall that quantiles are elicitable 
having the pinball loss as a strictly consistent loss function, see Theorem 5.33. 
We define a FN network regression model that allows us to directly estimate the 
quantiles based on the pinball loss. We therefore use an adapted version of the 
R code of Listing 9 in Richman [314], this adapted version has been proposed in 
Fissler et al. [130] to ensure that different quantiles respect monotonicity. For any 
two quantile levels 0 < t} < T2 < 1 we have 


F-'(1) < F7! (1), (11.17) 


where F—! denotes the generalized inverse of distribution function F, see (5.80). 
If we simultaneously learn these quantiles for different quantile levels t} < 1, 
we need to enforce the network to respect this monotonicity (11.17). This can be 
achieved by exploring a special network architecture in the output layer, and this is 
going to be presented in the next section. 

We start by considering a single deep t-quantile regression for a quantile level 
t € (0, 1). For datum (Y, x) we consider the regression function 


+i FH bat), (11.18) 


for a strictly monotone and smooth link function g, output parameter B, € R@*!, 
and where x t> zD (x) is a deep network. We add a lower index Y|x to the 
generalized inverse F. Y| : to highlight that we consider the conditional distribution 
of Y, given feature x € X. In the case of a deep FN network, (11.18) involves 
a network parameter Ŷ = wt, ong wo, B,)' that needs to be estimated. Of 
course, the deep network architecture x ++ zD (x) could also involve any other 


feature, such as CN or LSTM layers, embedding layers or a NLP text recognition 
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feature. This would change the network architecture, but it would not change 
anything from a methodological viewpoint. 

To estimate this regression parameter } from independent data (Y;, x;), 1 <i < 
n, we consider the objective function 


de Di (P a e): 


with the strictly consistent pinball loss function L+ for the t-quantile. Alternatively, 
we could choose any other loss function satisfying Theorem 5.33, and we may try 
to find the asymptotically optimal one (similarly to Theorem 11.8). We refrain from 
doing so, but we mention Komunjer—Vuong [222]. Fitting the network parameter 
v is then done in complete analogy to finding an optimal network parameter for 
network mean modeling. The only change is that we replace the deviance loss 
function by the pinball loss, e.g., in Listing 7.3 we have to exchange the loss function 
on line 5 correspondingly. 


11.2.2 Deep Quantile Regression: Multiple Quantiles 


We now turn our attention to the multiple quantile case that should satisfy the 
monotonicity requirement (11.17) for any quantile levels O < ti < n < 1. 
A separate deep quantile estimation for both quantile levels, as described in the 
previous section, may violate the monotonicity property, at least, in some part of 
the feature space 1, especially if the two quantile levels are close. Therefore, we 
enforce the monotonicity by a special choice of the network architecture. 

For simplicity, in the remainder of this section, we assume that the response Y is 
positive, a.s. This implies for the quantiles t œ> Fy, k 1 (t) > 0, and we should choose 
a link function with g7! > 0 in (11.18). To ensure the monotonicity (11.17) for the 
quantile levels 0 < tj < t2 < 1, we choose a second positive link function with 
ia > 0, and we set for multi-task forecasting 


= 

xb (Frc: on (11.19) 
T 

= (eB 28D), Bu t Pe) € RY, 

for a regression parameter 0 = w, a8 we, Bias po)" The positivity ge" >0 


enforces the monotonicity in the two quantiles We call (11.19) an additive approach 
as we start from a base level characterized by the smaller quantile Fy, Yix ! (t1), and any 
bigger quantile is modeled by an additive increment. To ensure monotonicity for 
multiple quantiles we proceed recursively by choosing the lowest quantile as the 
initial base level. 
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We can also consider the upper quantile as the base level by multiplicatively 
lowering this upper quantile. Choose the (sigmoid) function g3 >! e (0, 1) and set 
for the multiplicative approach 


xr (Fe, Fy h(a)” (11.20) 


= (85 Br EDO) eB ZPD), Ba) RY. 


Remark 11.10 In (11.19) and (11.20) we directly enforce the monotonicty by a 
corresponding regression function choice. Alternatively, we can also design a (plain- 
vanilla) multi-output network 


x (Fap, Fyi) (11.21) 
= (Ba EDO), Ba 2) E R. 


If we just use a classical SGD fitting algorithm, we will likely result in a situation 
where the monotonicity will be violated in some part of the feature space. Kellner 
et al. [211] consider this problem. They add a penalization (regularization term) that 
punishes during SGD training network parameters that violate the monotonicity. 
Such a penalization can be constructed, e.g., with the ReLU function. 


11.2.3 Lab: Deep Quantile Regression 


We revisit the Swiss accident insurance data of Sect. 11.1.2, and we provide an 
example of a deep quantile regression using both the additive approach (11.19) and 
the multiplicative approach (11.20). 

We select 5 different quantile levels Q = (T1, T2, T3, T4, T5) = (10%, 25%, 50%, 
75%, 90%). We start with the additive approach (11.19). It requires to set t} = 
10% as the base level, and the remaining quantile levels are modeled additively in 
a recursive way for Tj < tj41, 1 < j < 4. The corresponding R code is given on 
lines 8—20 of Listing 11.3, and this compiles to the 5-dimensional output on line 22. 
For the multiplicative approach (11.20) we set ts = 90% as the base level, and the 
remaining quantile levels are received multiplicatively in a recursive way for Tj41 > 
Tj,4 > j = 1, see Listing 11.4. The additive and the multiplicative approaches take 
the extreme quantiles as initialization. One may also be interested in initializing the 
model in the median t3 = 50%, the smaller quantiles can then be received by the 
multiplicative approach and the bigger quantiles by the additive approach. We also 
explore this case and we call it the mixed approach. 


eee eee Ree ee 
COINIDNMNBWNKTOMIADAUMNUBPWNHE 
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=... 
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Listing 11.3 Multiple FN quantile regression: additive approach 


Design 
# 


Network 


Q 
a 
Il 


q20 = N 
q30 = N 
q40 = N 


q50 = N 


model = 


= layer_input (sh 


Design %>% 
layer_dense ( 
layer_dense ( 
layer_dense ( 


ape = c(q0), dtype = ‘float32’, name = ‘Design’ ) 
units=20, activation=’tanh’, name=’FNLayerl’) %>% 
units=15, activation=’tanh’, name=’FNLayer2’) %>% 


units=10, activation=’tanh’, name=’FNLayer3’ 


Network %>% layer _dense(units=1, activation=’exponential’ ) 


etwork %>% layer dense(units=1, activation=’exponential’ ) 


ist (q1,q20) %>% 


ayer_add() 


etwork %>% layer_dense (units=1, activation='’exponential’ ) 


ist (q2, q30) %>% 


etwork %>% layer 
ist (q3,q40) %>% 


etwork %>% layeri 
ist (q4,q50) %>% 


keras_model (inpu 


ayer_add() 


dense(units=1, activation=’exponential’ ) 
ayer_add() 


dense(units=1, activation=’exponential’ ) 
ayer_add() 


ts = list(Design), outputs = c(ql,q2,q3,q4,q5) 


Listing 11.4 Multiple FN quantile regression: multiplicative approach 


q5 =N 
q40 = N 
q4 = 
q30 = N 
q3 = 
q20 = N 
q2 = 
qlo = N 
qi = 


etwork %>% layer_dense (units=1, activation='exponential') 
etwork %>% layer_dense (units=1, activation='sigmoid') 


ist (q5,q40) %>% 


etwork %>% layeri 
ist (q4, q30) %>% 


etwork %>% layeri 
ist (q3,q20) %>% 


etwork %>% layer_ 
ist (q2,q10) %>% 


ayer_multiply() 


dense(units=1, activation='sigmoid') 
ayer_multiply() 


dense(units=1, activation=’sigmoid’ ) 
ayer_multiply() 


dense(units=1, activation=’sigmoid’ ) 
ayer_multiply() 


Listing 11.5 Fitting a multiple FN quantile regression 


Q lossl 
Q_ loss2 
Q_ loss3 
Q loss4 
Q_ loss5 


# 


model % 


= function(y tru 


= function(y tru 


= function(y tru 


= function(y tru 


function(y tru 


>% compile(loss = 


e, y_pred) {k_mean(k_maximum(y true - y pred, 0) +» 
+ k_maximum(y_ pred - y_true, 0) * (1 - 0.1))} 
e, y_pred) {k_mean(k_maximum(y true - y pred, 0) + 
+ k_maximum(y_ pred - y_true, 0) * (1 - 0.25))} 
e, y_pred) {k_mean(k_maximum(y true - y pred, 0) + 
+ k_maximum(y pred - y_true, 0) * (1 - 0.5))} 
e, y_pred) {k_mean(k_maximum(y true - y pred, 0) + 
+ k_maximum(y_ pred - y_true, 0) * (1 - 0.75))} 
e, y_pred) {k_mean(k_maximum(y true - y pred, 0) + 
+ k_maximum(y pred - y_true, 0) * (1 - 0.9))} 


list(Q loss1,Q loss2,Q loss3,Q loss4,Q loss5), 
optimizer = ‘nadam’) 


25 


-75 
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These network architectures are fitted to the data using the pinball loss (5.81) for the 
quantile levels of Q; note that the pinball loss requires the assumption of having a 
finite first moment. Listing 11.5 shows the choice of the pinball loss functions. We 
then fit the three architectures (additive, multiplicative and mixed) to our learning 
data £, and we apply early stopping to prevent from over-fitting. Moreover, we 
consider the nagging predictor over 20 runs with different seeds to reduce the 
randomness coming from SGD fitting. 

In Table 11.6 we give the out-of-sample pinball losses on the test data 7 of the three 
considered approaches, and illustrating the 5 quantile levels of Q. The losses of the 
three approaches are rather close, giving a slight preference to the mixed approach, 
but the other two approaches seem to be competitive, too. We further analyze these 
quantile regression models by considering the empirical coverage ratios defined by 


PS 1 
a 1 11.22 
=F 3 CEAR (11.22) 


where F 7 5 (Tj) is the estimated quantile for level t; and feature xi. Remark that the 
ine 


coverage ratios (11.22) correspond to the identification functions that are essentially 
the derivatives of the pinball losses, we refer to Dimitriadis et al. [106]. Table 11.7 
reports these out-of-sample coverage ratios on the test data 7. From these results 
we conclude that on the portfolio level the quantiles are matched rather yell. 

In Fig. 11.8 we illustrate the estimated out-of-sample quantiles a Fik! L (t j) for 


individual claims on the quantile levels t; € {10%, 25%, 50%, 75%, 90%} (cyan, 
blue, black, blue, cyan colors) using the mixed approach. The x-axis considers 
the logged estimated medians Fo Vie! ' (50%). We observe heteroskedasticity resulting 


in quantiles that are not ordered watt. the median (black line). This supports the 
multiple deep quantile regression model because we cannot (simply) extrapolate the 
median to receive the other quantiles. 

In the final step we compare the estimated quantiles F Ae j) from the mixed deep 
quantile regression approach to the ones that can be “calculated from the fitted 
inverse Gaussian model using the double FN network approach of Example 11.4. 
In the latter model we estimate the mean f(x) and the dispersion G(x) with two 
FN networks, which then allow us to calculate the quantiles using the inverse 
Gaussian distributional assumption. Note that we cannot calculate the quantiles 
in Tweedie’s family with power variance parameter p = 2.5 because there is no 


Table 11.6 Out-of-sample pinball losses of quantile regressions using the additive, the multi- 
plicative and the mixed approaches; nagging predictors over 20 different seeds 


Out-of-sample losses on T 


10% 25% 50% 75% 90% | 
Additive approach 171.20 412.78 765.60 988.78 936.31 
Multiplicative approach 171.18 412.87 766.04 | 988.59 936.57 


Mixed approach 171.15 412.55 764.60 988.15 935.50 
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Table 11.7 Out-of-sample coverage ratios Tj below the estimated deep FN quantile estimates 


Foi (tj) 
Out-of-sample coverage ratios 
110% [25% = [50% 75% |90% 
Additive approach | 10.27% 25.30% [50.19% [75.08% [90.03% 
Multiplicative approach 10.18% 25.15% | 49.64% 75.14% | 90.22% 
Mixed approach 10.13% 25.03% | 50.32% 75.20% 90.08% 
Fig. 11.8 Estimated quantiles on individual claims 
out-of-sample quantiles x4 


El a. A 
Frist (tj) of 2’000 randomly 


selected individual claims on 
the quantile levels tj € 
{10%, 25%, 50%, 75%, 90%} 
(cyan, blue, black, blue, cyan 
colors) using the mixed 
approach, the red dots are the 
out-of-sample observations 
Yj; the x-axis gives Sant 
logF~ 1 + (50%) (also . z e observation 
Y|x 


ane median 


claims on log—-scale 


t . e 25%/75% quantile 
corresponding to the black * 1094/90% quantile 


diagonal line) 7 a 7 5 A 
logged estimated median 


closed form of the distribution function. Figure 11.9 compares the two approaches 
on the quantile levels of Q. Overall we observe a reasonably good match though it is 
not perfect. The small quantiles for level r} = 10% seem slightly under-estimated 
by the inverse Gaussian approach (see Fig. 11.9 (top-left)), whereas big quantiles 
T4 = 75% and t5 = 90% seem more conservative in the inverse Gaussian approach 
(see Fig. 11.9 (bottom)). This may indicate that the inverse Gaussian distribution 
does not fully fit the data, i.e., that one cannot fully recover the true quantiles 
from the mean f(x), the dispersion (x) and an inverse Gaussian assumption. 
There are two ways to further explore these issues. One can either choose other 
distributional assumptions which may better match the properties of the data, this 
further explores the distributional approach. Alternatively, Theorem 5.33 allows us 
to choose loss functions different from the pinball loss, i.e., one could consider 
different increasing functions G in that theorem to further explore the distribution- 
free approach. In general, any increasing choice of the function G leads to a strictly 
consistent quantile estimation (this is an asymptotic statement), but these choices 
may have different finite sample properties. Following Komunjer—Vuong [222], we 
can determine asymptotically efficient choices for G. This would require feature 
dependent choices Gx;(y) = Fyjx;(y), where Fyjx; is the (true) distribution of 
Y;, conditionally given x;. This requires the knowledge of the true distribution, 
and Komunjer—Vuong [222] derive asymptotic efficiency when replacing this true 
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Fig. 11.9 Inverse Gaussian quantiles vs. deep quantile regression estimates of 2°000 randomly 
selected claims on the quantile levels of Q = (10%, 25%, 50%, 75%, 90%) 


distribution by a non-parametric estimator, this is in spirit similar to Theorem 11.8. 
We refrain from giving more details but refer to the corresponding paper. 


11.3 Deep Composite Model Regression 


We have established a deep quantile regression in the previous section. Next we 
jointly estimate quantiles and conditional tail expectations (CTEs), leading to a 
composite regression model that has a splicing point determined by a quantile level; 
for composite models we refer to Sect. 6.4.4. This is exactly the proposal of Fissler et 
al. [130] which we are going to present in this section. Note that having a composite 
model allows us to have different distributions and regression structures below and 
above the splicing point, e.g., we can have a more heavy-tailed model in the upper 
tail using a different feature engineering from the main body of the data. 


11.3.1 Joint Elicitability of Quantiles and Expected Shortfalls 


In the previous examples we have seen that the distributional models may misesti- 
mate the true tail of the data because model fitting often pays more attention to an 
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accurate model fit in the main body of the data. An idea is to directly estimate this 
tail in a distribution-free way by considering the (upper) CTE 


CTEH(¥|x) = |x |y sE F|: (11.23) 


for a given quantile level t € (0, 1). The problem with (11.23) is that this is not an 
elicitable quantity, i.e., there is no loss/scoring function that is strictly consistent for 
the CTE functional. 

If the distribution function Fy), is continuous, we can rewrite the upper CTE as 
follows, see Lemma 2.16 in McNeil et al. [268] and (11.35) below, 


1 bo = 
CTE} (Y|x) = ES} (Y |x) = —/ Fy, (P) dp > Fy, (2). (11.24) 
T 


This second object ES{ (Y |x) is called the upper expected shortfall (ES) of Y, given 
x, on the security level t. Fissler—Ziegel [131] and Fissler et al. [132] have proved 
that BST (Y |x) is jointly elicitable with the t-quantile Fi, (t). That is, there is a 
strictly consistent bivariate loss function that allows one to jointly estimate the t- 
quantile and the corresponding ES. In fact, Corollary 5.5 of Fissler—Ziegel [131] 
give the full characterization of the strictly consistent bivariate loss functions for 
the joint elicitability of the t-quantile and the ES; note that Fissler—Ziegel [131] 
use a different sign convention. This result is used in Guillén et al. [175] for the 
joint estimation of the quantile and the ES within a GLM. Guillén et al. [175] use a 
two-step approach to fit the quantile and the ES. 

Fissler et al. [130] extend the results of Fissler—Ziegel [131], allowing for the 
joint estimation of the composite triplet consisting of the lower ES, the t-quantile 
and the upper ES. This gives us a composite model that has the t-quantile as splicing 
point. The beauty of this approach is that we can fit (in one step) a deep learning 
model to the upper and the lower ES, and perform a (potentially different) regression 
in both parts of the distribution. The lower CTE and the lower ES are defined by, 
respectively, 


CTE? (|x) = fy |y sr, a 
and 
2 1 ies =i 
ES; |e) = > ‘ Fy (P)dp < Fy), (7). 


Again, in case of a continuous distribution function Fyjy we have the following 
identity CTE, (Y|x) = ES; (Y |x). From the lower and upper CTEs we receive the 
mean of Y, given x, by 


u(x) = E[Y|x] = t CTE; (Y|x) + (1 — t) CTE} (Y |x). (11.25) 


11.3 Deep Composite Model Regression 485 
We introduce the auxiliary scoring functions 


S7 (Y,a) = (Ipy<a} — T) a — Uy<ayy, 
S$ O, a) = (1—t—Apsaj)a+1psay = SZO, a) +y, 


for y,a € R and for t € (0, 1). These auxiliary functions consider only the part 
of the pinball loss (5.81) that depends on action a, and we get the pinball loss as 
follows 


L:(y,a) = S (y,a)+ty = S}, a)-— (1 -—T)y. 


Therefore, all three functions provide strictly consistent scoring functions for the 
t-quantile, but only the pinball loss satisfies the calibration property (LO) on page 
92. 

For the following theorem we recall the general definition of the t-quantile 
O,(Fy\x) of a distribution function Fyjx, see (5.82). 


Theorem 11.11 (Theorem 2.8 of Fissler et al. [130], Without Proof) Choose t € 
(0, 1) and let F contain only distributions with a finite first moment, and being 
supported in the interval € C R. The loss function L : € x ©? — R4 of the form 


L(y; e7, q, e) = (GO) — G@) (t — Upy<q}) (11.26) 


~ +4570, z 
+ (vue. k TEGO) ve et) + Wy, y), 


is strictly consistent for the composite triplet (ES; , Q+, ES*) relative to the class 
F, if Y is strictly convex with (sub-)gradient VW such that for all (e7, eT) € e? 
the function 


1 ə 7 1 ð _ 
q > Ge- e+ (4) = G(q) + -= Vee ,e')q — — — Y (e ,e')q, 
T ðe 1— t det 
(11.27) 


is strictly increasing, and if Er[|G(Y)|] < œ, Er[|Y (Y, Y)|] < œ forall Y ~ 
FEF. 


This opens the door for regression modeling of CTEs for continuous distribution 
functions Fy\,, x € X. Namely, we can choose a regression function p with a 
three-dimensional output 


xEX r lx) eC, 
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depending on a regression parameter #. This regression function is now used to 
describe the composite triplet (ES; (Y |x), Fay (T), ES} (Y |x)). Having i.i.d. data 
(Yi, xi), 1 <i < n, it can be fitted by solving 


~ 


iy 
v = argmin — J L (Yi; &9(xi)) , (11.28) 
v n i=l 


with loss function L given by (11.26). This then provides us with the estimates for 
the composite triplet 


— =~ —— 
x +> (x) = (ES; Yl), Fpa @), ES? Ix). 
There remains the choice of the functions G and W, such that W is strictly convex 
and Ge- e+, defined in (11.27), is strictly increasing. Section 2.3 in Fissler et 


al. [130] discusses possible choices. A simple choice is to select the identity function 
G(y) = y (which gives the pinball loss on the first line of (11.26)) and 


We ,e7) = Wile ) + wre), 


with yı and yp strictly convex and with (sub-)gradients y, > 0 and y} < 0. 
Inserting this choice into (11.26) provides the loss function 


f. — E + 
me + e Lry, q)+ Dy Q, e )+Dy, (y, e*), 


L(y; e, q, e") = [i + f 
(11.29) 


where L;(y,q) is the pinball loss (5.81) and Dy, and Dy, are Bregman diver- 
gences (2.28). There remains the choices of yı and y2 which should be strictly 
convex, the first one being strictly increasing and the second one being strictly 
decreasing. 

We restrict ourselves to strictly convex functions y on the positive real line R+, 
i.e., for positive claims Y > 0, a.s. For b € R, we consider the following functions 
on Ry 


m”  forb#0andb#1, 
YPO) = { —1—log(y) forb = 0, (11.30) 
ylog(y)—y forb=1. 
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We compute the first and second derivatives. These are for y > 0 given by 


_ fry?! ford #1, 


32 
and — p(y) = ae > 0. 
log(y)  forb=1, dy? 


ow 
TH (y) 


Thus, for any b € R we have a convex function, and this convex function is 
decreasing on R+ for b < 1 and increasing for b > 1. Therefore, we have to select 
b > 1 for yı and b < 1 for y⁄ to get suitable choices in (11.29). Interestingly, 
these choices correspond to Lemma 11.2 with power variance parameters p = 
2 — b, i.e., they provide us with Bregman divergences from Tweedie’s distributions. 
However, (11.30) is more general, because it allows us to select any b € R, 
whereas for power variance parameters p € (0, 1) there do not exist any Tweedie’s 
distributions, see Theorem 2.18. 

In view of Lemma 11.2 and using the fact that unit deviances 0, are Bregman 
divergences, we select a power variance parameter p = 2 — b > 1 for yn and we 
select the Gaussian model p = 2 — b = 0 for y1. This gives us the special choice 
for the loss function (11.29) for strictly positive claims Y > 0, a.s., 


me nz (et)! 
(d—t)(p—-1) 


ae 1 em, 2 
MEE tne” )= + |e. q+ Pdo(vse )+ Pao, ey, 


(11.31) 


with the Gaussian unit deviance 09(y, e7) = (y — e7)? and Tweedie’s unit deviance 
Dp with power variance parameter p > 1, see Sect. 11.1.1. The additional constants 
1,2 > 0 are used to balance the contributions of the individual terms to the total 
loss. Typically, we choose p > 2 for the upper ES reflecting claim size models. 
This choice for y2 implies that the residuals are weighted inversely proportional 
to the corresponding variances u” within Tweedie’s family, see (11.5). Using 
this loss function (11.31) in (11.28) allows us to estimate the composite triplet 
(ES; (Y|x), Fry), EST (Y|x)) with a strictly consistent loss function. 


11.3.2 Lab: Deep Composite Model Regression 


The joint elicitability of Theorem 11.11 allows us to directly estimate these 
functionals for a fixed quantile level rt € (0,1). In a similar way to quantile 
regression we set up a FN network that respects the monotonicity ES; (Y|x) < 


DANhkWNe 
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Fy, '@) < EST (Y |x). We set for the regression function in the additive approach 
for multi-task learning 


Paw (ES; Œ lx), FyL(t), ES Y) 
= (87B EDE, gH EIE) + 87 (Ba E0), (11.32) 


P 
BB, EPE) + gT Ba EDE) HeFe EDE) A, 


for link functions g and g+ with p > 0, deep FN network z@) ; Rat! _, 
Ret! regression parameters B,, 62, ß3 € R44+1 and with the action space 
A = {(e7,q,e+) € RÌ; e7 < q < e+} for positive claims. We also remind of 
Remark 11.10 for a different way of modeling the monotonicity. 

Fitting this model is similar to the multiple deep quantile regression presented 
in Listings 11.3 and 11.5. There is one important difference though. Namely, we 
do not have multiple outputs and multiple loss functions, but we have a three- 
dimensional output with a single loss function (11.31) simultaneously evaluating all 
three components of the output (11.32). Listing 11.6 gives this loss for the inverse 
Gaussian case p = 3 in (11.31). 


Listing 11.6 Loss function (11.31) for p = 3 


Bregman _IG = function(y true, y_ pred) { 
k_mean( (k_maximum(y true[,1]-y_pred[,2],0)*tau0d + 
k_maximum(y_pred[,2]-y_true[,1],0)*(1l-tau0) ) 
( 1 + etal*y pred[,1]/tau0 + eta2«*y pred[,3] “~(-2)/(2*(1-tau0)) ) + 
etalx(y_true[,1]-y_pred[,1])*2/2 + 
eta2«((y_true[,1]-y_pred[,3])*2/(y_pred[,3]*2*y true[,1]))/2 )} 


* 


We revisit the Swiss accident insurance data of Sect. 11.2.3. We again use a FN 
network of depth d = 3 with (q1,q2,q3) = (20,15, 10) neurons, hyperbolic 
tangent activation, two-dimensional embedding layers for the categorical features, 
exponential output activations for g~! and Bas and the additive structure (11.32). 
We implement the loss function (11.31) for quantile level t = 90% and with power 
variance parameter p = 3, see Listing 11.6. This implies that for the upper ES 
estimation we scale residuals with V (u) = u? , see (11.5). We then run an initial 
calibration of this FN network. Based on this initial calibration we can calculate 
the three loss contributions in (11.31) coming from the composite triplet. Based on 
these figures we choose the constants 71,72 > 0 in (11.31) so that all three terms 
of the composite triplet contribute equally to the total loss. For the remainder of our 
calibration we hold on to these choices of 7; and 72. 

We calibrate this deep FN architecture to the learning data £, using the strictly 
consistent loss function (11.31) for the composite triplet (ESgpq,(Y |x), Fy, > (90%), 


ES dye (Y|x)), and to reduce the randomness in prediction we average over 20 early 
stopped SGD calibrations with different seeds (nagging predictor). 
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Fig. 11.10 Comparison of deep composite model regression 
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Figure 11.10 shows the estimated lower and upper ES against the corresponding 

90%-quantile estimates for 2’000 randomly selected insurance claims x}. The 

diagonal orange line shows the estimated 90%-quantiles Fad (90%), and the cyan 
x; 


lines give spline fits to the estimated lower and upper ES. It is clearly visible that 
these respect the ordering 


ES Ylh) < Fy 1,00%) < ESj Y lx), 


for fixed features x} EX. 

The deep quantile regression has been back-tested using the coverage 
ratios (11.22). Back-testing the ES is more difficult, the standalone ES is not 
elicitable, and the ES can only be back-tested jointly with the corresponding 
quantile. The part of the joint identification function that corresponds to the ES is 
given by, see (4.2)—(4.3) in Fissler et al. [130], 


YİL : + Fo .(r)(t—1 
2 á EEmra) Yx}, ) ESG 
T= 3 Broh- —— 5 r 


(11.33) 


and 


Y/1 ; + Fo) (4 l —T 
12 P : [i= F] viet ) Ema) 
~ aae + x; x; 

D = 7 LES: 1) a ne 


(11.34) 
These (empirical) identifications should be close too zero if the model fits the data. 
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Remark that the latter terms in (11.33)-(11.34) describe the lower and upper 
ES also in the case of non-continuous distribution functions because we have the 
identity 


J + Fy, (T) (z — FYlx (Fi) 
(11.35) 


= 1f 
ES; (Y |x) = 7 D ETE 


the second term being zero for a continuous distribution Fyjx, but it is needed for 
non-continuous distribution functions. 

We compare the deep composite regression results of this section to the deep 
gamma and inverse Gaussian models using a double FN network for dispersion 
modeling, see Sect. 11.1.3. This requires to calculate the ES in the gamma and the 
inverse Gaussian models. This can be done within the EDF, see Landsman—Valdez 
[233]. The upper ES in the gamma model Y ~ T (œ, B) is given by, see (6.47), 


[yly>Fl@]=5 — Caa 


where G is the scaled incomplete gamma function (6.48) and Fy lr) is the t- 
quantile of T (œ, £). 

Example 4.3 of Landsman-Valdez [233] gives the inverse Gaussian case (2.8) 
witha, 8 > 0 


s|¥ |y > m | = - (: + a m'oe) 
a l/a >», = 
m F p (2001-22 -y Fy p? ) l 


where g and ® are the standard Gaussian density and distribution, respectively, 
Fy l (t) is the t-quantile of the inverse Gaussian distribution and 


zg) Sat (2 = ) and z® a (2 + ) : 
{Fria \ e VFri@ \ W/8 


This now allows us to calculate the identifications (11.33}(11.34) in the fitted 
deep double networks using the gamma and the inverse Gaussian distributions of 
Sect. 11.1.3. 

Table 11.8 shows the out-of-sample coverage ratios and the identifications of the 
deep composite regression and the two distributional approaches. These figures 
suggest that the gamma model is not competitive; the deep composite model has 
the most precise coverage ratio. In terms of the ES identification terms, the deep 
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Table 11.8 Out-of-sample coverage ratios T and identifications 0— and 0 of the deep composite 
regression model and the deep double networks in the gamma and inverse Gaussian cases 


[ Coverage Lower ES Upper ES 
| ratio identification identification 
| t = 90% pe UL 
Deep composite model | 90.12% 32.9 -143.5 
Deep double network gamma | 93.51% 356.6 -2°409.0 
Deep double network inverse Gaussian | 92.56% —13.0 115.1 
Fig. 11.11 Comparison of deep double IG vs. composite model 
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composite model and the double network with inverse Gaussian claim sizes are 
comparably accurate (out-of-sample) determining the lower and upper 90% ES. 
Finally, we paste the lower and upper ES from the deep composite regression 
model according to (11.25). This gives us an estimated mean (under a continuous 
distribution function) 


fi(x) = E[Y|x] = t ÉS, (Y|x) + (1 — 1) ÉS} (Y|x). 


Figure 11.11 compares these estimates of the deep composite regression model 
to the deep double inverse Gaussian model estimates. The black dots show 2’000 
randomly selected claims xi, and the cyan line gives a spline fit to all out-of-sample 
claims in 7. The body of the estimates is rather similar in both approaches but the 
deep composite approach provides more large estimates, the dotted orange lines 
show the maximum estimate from the deep double inverse Gaussian model. 

We conclude that in the case where no member of the EDF reflects the properties 
of the data in the tail, the deep composite regression approach presented in this 
section provides an alternative method for mean estimation that allows for separate 
models in the main body and the tail of the data. Fixing the quantile level allows 
for a straightforward fitting in one step, this is in contrast to the composite models 
where we fix the splicing point. The latter approaches are more difficult in fitting, 
e.g., using the EM algorithm. 
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11.4 Model Uncertainty: A Bootstrap Approach 


As described in Sect. 4, there are different sources of prediction uncertainty when 
forecasting random variables. There is the irreducible risk that comes from the fact 
that we try to predict random variables. This source of uncertainty is always present, 
even if we know the true data generating mechanism, i.e., it is irreducible. In most 
applied situations we do not know the true data generating mechanism which results 
in additional prediction uncertainty. Within GLMs this source of uncertainty has 
mainly been allocated to parameter estimation uncertainty deriving from the fact that 
we estimate the parameters from a finite sample, we refer to Sects. 3.4 and 11.1.4 
on asymptotic results. In network modeling, the situation is more complicated. 
Firstly, we have seen that there is no best network regression model even if the 
architecture and the hyper-parameters are fully specified. In Fig. 7.18 we have seen 
that in a claim frequency context the different solutions from an early stopped SGD 
fitting can have a coefficient of variation of up to 40% on the individual policy 
level, on average these coefficients of variation were around 10%. This has led to 
the consideration of network ensembling and the nagging predictor in Sect. 7.4.4. 
These considerations have been based on a fixed learning data set £. In this section, 
we assume that also the learning data set £ may look differently by considering 
different realizations of the (randomly generated) observations Y;. To reflect this 
source of randomness in outcomes we bootstrap new data from £ by exploring 
a non-parametric bootstrap with random drawings with replacements from £, see 
Sect. 4.3.1. This will allow us to study the volatility implied in estimation by 
considering a different set of observations, i.e., a different sample. 

Ideally we would like to generate new observations from the true data generating 
mechanism, but, since this mechanism is not known, we can at best generate data 
from an estimated model. If we rely on a distributional model, we may suffer from 
model error, e.g., in Sect. 11.3 we have seen that it is rather difficult to specify a 
distributional regression model that has the right tail behavior. Therefore, we may 
give preference to a distribution-free approach. Non-parametric bootstrapping is 
such a distribution-free approach, the disadvantage being that we cannot enrich the 
existing observations by new observations, but we can only rearrange the available 
observations. 

We revisit the robust representation learning approach of Sect. 11.1.2 on the 
same Swiss accident insurance data as explored in that section. In particular, 
we reconsider the deep multi-output models introduced in (11.6) and studied in 
Table 11.3 for power variance parameters p = 2, 2.5,3 (and constant dispersion 
parameter). We perform exactly the same analysis, here, however we consider for 
this analysis bootstrapped data £* for model fitting. 

First, we fit 100 times the same deep FN network architecture as in (11.6) 
with different seeds (on identical learning data £). From this we calculate the 
nagging predictor. Second, we generate 100 different bootstrap samples £* = 
Le) 1 < s < 100, from £ (having an identical sample size) with random 
drawings with replacements, and we fit the same network architecture to these 100 
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Table 11.9 Out-of-sample losses (gamma loss, power variance case p = 2.5 loss (in 1077) and 
inverse Gaussian (IG) loss (in 107° )) and average claim amounts; the losses use unit dispersion 
g=l 


Out-of-sample loss on T Average 

Op=2 0 p=2.5 0p=3 claim 
Null model 4.6979 10.2420 4.6931 1°774 
Gamma multi-output of Table 11.3 2.0581 7.6422 3.9146 1745 
p = 2.5 multi-output of Table 11.3 2.0576 7.6407 3.9139 1732 
IG multi-output of Table 11.3 2.0576 7.6401 3.9134 1°705 
Gamma multi-output: nagging 100 2.0280 7.5582 3.8864 1752 
p = 2.5 multi-output: nagging 100 2.0282 | 7.5586 3.8865 1739 
IG multi-output: nagging 100 2.0286 | 7.5592 3.8865 711 
Gamma multi-output: bootstrap 100 2.0189 7.5301 3.8745 1803 
p = 2.5 multi-output: bootstrap 100 2.0191 7.5305 3.8746 1°790 
IG multi-output: bootstrap 100 2.0194 7.5309 3.8746 1°756 


bootstrap samples. We then also average over these 100 predictors obtained from 
the different bootstrap samples. Table 11.9 provides the resulting out-of-sample 
deviance losses on the test data 7. We always hold on to the same test data 7 
which is disjoint/independent from the learning data £ and the bootstrap samples 
ra, 1 <s = 100. 

The nagging predictors over 100 seeds are roughly the same as over 20 seeds 
(see Table 11.3), which indicates that 20 different network fits suffice, here. 
Interestingly, the average bootstrapped version generally improves the nagging 
predictors. Thus, here the average bootstrap predictor provides a better balance 
among the observations to receive superior predictive power on the test data 7, 
compare lines ‘nagging 100’ vs. ’bootstrap 100’ of Table 11.9. 

The main purpose of this analysis is to understand the volatility involved in nagging 
and bootstrap predictors. We therefore consider the coefficients of variation Vco; 
introduced in (7.43) on individual policies 1 < t < T. Figure 11.12 shows these 
coefficients of variation on the individual predictors, i.e., for the individual claims 
x} and the individual network calibrations with different seeds. The left-hand side 
gives the coefficients of variation based on 100 bootstrap samples, the right-hand 
side gives the coefficients of variation of 100 predictors fitted on the same data £ 
but with different seeds for the SGD algorithm; the y-scale is identical in both plots. 
We observe that the coefficients of variation are clearly higher under the bootstrap 
approach compared to holding on to the same data £ for SGD fitting with different 
seeds. Thus, the nagging predictor averages over the randomness in different seeds 
for network calibrations, whereas bootstrapping additionally considers possible 
different samples £* for model learning. We analyze the difference in magnitudes 
in more detail. 

Figure | 1.13 compares the two coefficients of variation for different claim sizes. The 
average coefficient of variation for fixed observations £ is 15.9% (cyan columns). 
This average coefficient of variation is increased to 24.8% under bootstrapping 
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Fig. 11.12 Coefficients of variation in individual estimators (lhs) bootstrap 100, and (rhs) nagging 
100; the y-scale is identical in both plots 
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Fig. 11.13 Coefficients of variation in individual predictors of the bootstrap and the nagging 
approaches (ordered w.r.t. estimated claim sizes) 


(orange columns). The blue line shows the average relative increase for the different 
claim sizes (right axis), and the blue dotted line is at a relative increase of 40%. From 
Fig. 11.13 we observe that this spread (relative increase) is rather constant across all 
claim predictions; we remark that 93.5% of all claim predictions are below 5’000. 
Thus, most claims are at the left end of Fig. 11.13. 

From this small analysis we conclude that there is substantial model and 
estimation uncertainty involved, recall that we fit the deep network architecture to 
305’550 individual claims having 7 feature components, this is a comparably large 
portfolio. On average, we have a coefficient of variation of 15% implied by SGD 
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fitting with different seeds, and this coefficient of variation is increased to roughly 
25% under additionally bootstrapping the observations. This is considerable, and 
it requires that we ensemble these predictors to receive more robust predictions. 
The results of Table 11.9 support this re-sampling and ensembling approach as we 
receive a better out-of-sample performance. 


11.5 LocalGLMnet: An Interpretable Network Architecture 


Network architectures are often criticized for not being (sufficiently) explainable. 
Of course, this is not fully true as we have gained a lot of insight about the 
data examples studied in this book. This criticism of non-explainability has led to 
the development of the post-hoc model-agnostic tools studied in Sect. 7.6. This 
approach has been questioned at many places, and it is not clear whether one 
should try to explain black box models, or whether one should rather try to make 
the models interpretable in the first place, see, e.g., Rudin [322]. In this section 
we take this different approach by working with a network architecture that is 
(more) interpretable. We present the LocalGLMnet proposal of Richman—Wiithrich 
[317, 318]. This approach allows for interpreting the results, and it allows for 
variable selection either using an empirical Wald test or LASSO regularization. 

There are different other proposals that try to achieve similar explainability in 
specific network architectures. There is the explainable neural network of Vaughan 
et al. [367] and the neural additive model of Agarwal et al. [3]. These proposals 
rely on parallel networks considering one single variable at a time. Of course, 
this limits their performance because of a missing interaction potential. This has 
been improved in the Combined Actuarial eXplainable Neural Network (CAXNN) 
approach of Richman [314], which requires a manual specification of parallel 
networks for potential interactions. The LocalGLMnet, proposed in this section, 
does not require any manual engineering, and it still possesses the universal 
approximation property. 


11.5.1 Definition of the LocalGLMnet 


Starting point of the LocalGLMnet is a classical GLM. Choose a strictly monotone 
and smooth link function g. A GLM is received by considering the regression 
function 


q 
x > g(u(x)) = Bo + (B, x) = Bo + > Bjx;, (11.36) 
j=l 


496 11 Selected Topics in Deep Learning 


for features x € X C RY, intercept Bo € R and regression parameter B € 
R1. Compared to (5.5) we change the notation in this section by excluding the 
intercept component from the feature x = (x1,..., cae because this will be 
more convenient for the LocalGLMnet proposal. The beauty of this GLM regression 
function is that we obtain a linear function after applying the link function g. This 
linear function is considered to be explainable as we can precisely quantify how 
much the expected response will change by slightly changing one of the feature 
components xj. In particular, this holds true for the log-link which leads to a 
multiplicative structure in the expected response. 

The idea is to hold on to this additive structure (11.36) as far as possible, still 
trying to benefit from the universal approximation property of network architectures. 
Richman-—Wiithrich [317] propose the following regression structure. 


Definition 11.12 (LocalGLMnet) Choose a EN network architecture z@) : 
IR? — R1 of depth d € N with equal input and output dimensions to model 
the regression attention 


B:R? > RI 


xt B(x) ECD = (Ge eee az) (or 


The LocalGLMnet is defined by the generalized additive decomposition 


q 
x > g(u(x)) = bo + (B), x) = bo + $ bj@)x;, 


j=l 


for a strictly monotone and smooth link function g. 


This architecture is called LocalGLMnet because locally, around a given feature 
value x, it can be understood as a GLM, supposed that B(x) does not change too 
much in the environment of x. In the GLM context £ is called regression parameter, 
and in the LocalGLMnet context B(x) is called regression attention because the 
components £; (x) determine how much attention there should be given to a specific 
value x ;. We highlight this in the following discussion. Select one component 1 < 
j < q and study the individual term 


x b> Bj(x)x;. (11.37) 


(1) If 8; (x) = 0, we should drop the term £ ; (x)x; from the regression function. 
(2) If B(x) = B; (€ 0) is not feature dependent (and different from zero), we 
receive a GLM term in x; with regression parameter £j. 
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(3) Property Bj(x) = Bj(x;) implies that we have a term 6; (x;)x; that does not 
interact with any other term xj, j’ # j. 


(4) Sensitivities of 6; (x) in the components of x can be obtained by the gradient 


s= 


ð ð f 
Vx Bj (x) = (E0) e R3. (11.38) 
q 


The j-th component of Vy 8; (x) determines the (non-)linearity in term xj, the 
components different from j describe the interactions of term x; with the other 
components. 

These interpretations need some care because we do not have identifiability. For 
the special regression attention Bj (x) = x j//xj; we have 


(5 


wm 


Bj(x)xj = xj. (11.39) 


Therefore, we talk about terms in items (1)—(4), e.g., item (1) means that the 
term 6;(x)xj; can be dropped, however, the feature component x; may still 
play a significant role in some of the regression attentions £; (x), j’ # j. 

In practical applications we have not experienced identifiability issue (11.39). 
Having already the linear terms in the LocalGLMnet regression structure 
and starting the SGD fitting in the GLM gives already quite pre-determined 
regression functions, and the LocalGLMnet is built around this initialization, 
hardly falling into a completely different model (11.39). 

The LocalGLMnet architecture has the universal approximation property dis- 
cussed in Sect. 7.2.2, because networks can approximate any continuous 
function arbitrarily well on a compact support for sufficiently large networks. 
We can then select one component, say, x; and let (x) = 4) (x) 
approximate a given continuous function f(x)/x1, ie., f(x) ~ Bi (x)x1 
arbitrarily well on the compact support. 


(6 


wm 


11.5.2 Variable Selection in LocalGLMnets 


The LocalGLMnet allows for variable selection through the regression attentions 
fj (x). Roughly speaking, if the estimated regression attentions 6; (x) ~ 0, then the 
term £;(x)xj; can be dropped. We can also explore whether the entire variable xj 
should be dropped (not only the corresponding term f;(x)x;). For this, we have to 
refit the LocalGLMnet excluding the feature component xj. If the out-of-sample 
performance on validation data does not change, then x; also does not play an 
important role in any other regression attention £; (x), j’ # j, and it should be 
completely dropped from the model. 

In GLMs we can either use the Wald test or the LRT to test a null hypothesis Ho : 
j = 0, see Sect. 5.3. We explore a similar idea in this section, however, empirically. 


498 11 Selected Topics in Deep Learning 


We therefore first need to ensure that all feature components live on the same scale. 
We consider standardization with the empirical mean and the empirical standard 
deviation, see (7.30), and from now on we assume that all feature components are 
centered and have unit variance. Then, the main problem is to determine whether an 
estimated regression attention Ê; (x) is significantly different from 0 or not. 

We therefore extend the features x? = (x1,...,Xq, qi) € R1+! by an addi- 
tional independent and purely random component xq+1 that is also standardized. 
Since this additional component is independent of all other components it cannot 
have any predictive power for the response under consideration, thus, fitting this 
extended model should result in a regression attention Bott (x*) ~ 0. The estimate 
will not be exactly zero, because there is noise involved, and the magnitude of this 
fluctuation will determine the rejection/acceptance region of the null hypothesis of 
not being significant. 

We fit the LocalGLMnet to the learning data £ with features x7 e RIH! 
extended by the standardized i.i.d. component xi, 4+1 being a uaa of (Y;, xi). 
This gives us the estimated regression attentions Bi (xt Je Pa (xi ), Boi (xf). 
We compute the empirical mean and standard deviation of the attention weight of 
the additional component xq+1 


1 n 


z pes a ~ =g 
bg = 7 X Bat (x7) and Sqtl = = > (Bari ŒF) — bgi). 
{=l i=l 
(11.40) 


We expect approximate centering bg+1 7X 0 because this additional component xg+1 
does not enter the true regression function, and the empirical standard deviation 5; +1 
quantifies the expected fluctuation around zero of insignificant components. 

We can now test the null hypothesis Ho : Bj(x) = 0 of component j on 
significance level æ € (0, 1/2). We define centered interval 


la = [7 0/2) Sai, DA -0/2 Hai], (11.41) 


where ®~!( p) denotes the standard Gaussian quantile for p € (0, 1). Ho should be 
rejected if the coverage ratio of this centered interval J, is substantially smaller than 
1 — g, i.e., 


1 n 
= geht) < l-a 
i=l 


This proposal is designed for continuous feature components, and categorical 
variables are discussed in Sect. 11.5.4, below. For xg+1 we can choose a standard 
Gaussian distribution, a normalized uniform distribution or we can randomly 
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permute one of the feature components x;,; across the entire portfolio 1 <i < n. 
Usually, the resulting empirical standard deviations 5,41 are rather similar. 


11.5.3 Lab: LocalGLMnet for Claim Frequency Modeling 


We revisit the French MTPL data example. We compare the LocalGLMnet approach 
to the deep FN network considered in Sect. 7.3.2, and we benchmark with the results 
of Table 7.3; we benchmark with the crudest FN network from above because, at 
the current stage, we need one-hot encoding for the LocalGLMnet approach. The 
analysis in this section is the same as in Richman—Wiithrich [317]. 

The French MTPL data has 6 continuous feature components (we treat Area as 
a continuous variable), 1 binary component and 2 categorical components. We pre- 
process the continuous and binary variables to centering and unit variance using 
standardization (7.30). This will allow us to do variable selection as presented 
in (11.41). The categorical variables with more than two levels are more difficult. 
In a first attempt we use one-hot encoding for the categorical variables. We prefer 
one-hot encoding over dummy coding because this ensures that for all levels there 
is a component x; with x; # 0. This is important because the terms fj (x)xj are 
equal to zero for the reference level in dummy coding (since x; = 0). This does 
not allow us to study interactions with other variables for the term corresponding to 
the reference level. Remark that one-hot encoding and dummy coding do not lead 
to centering and unit variance. 

This feature pre-processing gives us a feature vector x € R? of dimension 
q = 40. For variable selection of the continuous and binary components we extend 
the feature x by two additional independent components x74; and xg42. We select 
two components to explore whether the particular distributional choice has some 
influence on the choice of the acceptance/rejection interval J, in (11.41). We choose 
for policies 1 <i < n 


Xiq+1 iid Uniform |-v3, v3] and Xi,q+2 RS N (0, 1), 


these two sets of variables being mutually independent, and being inde- 
pendent from all other variables. We define the extended features x7 = 
iise -s Rigs Xiq+1» Xi,g+2)' € R® with qo = q + 2, and we consider the 


LocalGLMnet regression function 


qo 
xt > log(u(x*)) = bo + > Bj (xT )x;. 
jal 


We choose the log-link for Poisson claim frequency modeling. The time exposure 
v > Ocan either be integrated as a weight to the EDF or as an offset on the canonical 
scale resulting in the same Poisson model, see Sect. 5.2.3. 


CmMAIDMPWNK 
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Listing 11.7 LocalGLMnet architecture 


Design = layer_input(shape = c(42), dtype = ‘float32’, name = ‘Design’ ) 
Vol = layer_input(shape = c(1), dtype = ‘float32’, name = ‘Vol’ 
# 


Attention = Design %>% 
layer_dense(units=20, activation=’tanh’, name=’FNLayer1’ ) 
layer_dense(units=15, activation=’tanh’, name=’FNLayer2’ ) 
layer _dense(units=10, activation=’tanh’, name=’FNLayer3’) % 
layer _dense(units=42, activation=’linear’, name=’Attention’ 

# 

LocalGLM = list (Design, Attention) %>% layer dot (name=’LocalGLM’, axes=1) %>% 
layer_dense(units=1, activation=’exponential’, name=’Balance’ ) 

# 

Response = list (LocalGLM, Vol) %>% layer _multiply(name=’Multiply’ ) 

# 


keras_model(inputs = c(Design, Vol), outputs = c(Response) ) 


We are now ready to define the LocalGLMnet architecture. We choose a network 
z@) : Ro —> RV of depth d = 4 with (q1, q2,q3,q4) = (20, 15, 10, 42) 
neurons. The R code is given in Listing 11.7. We note that this is not much more 
involved than a plain-vanilla FN network. Slightly special in this implementation is 
the integration of the intercept Bo on line 11. Naturally, we would like to add this 
intercept, however, there is no simple code for doing this. For that reason, we model 
the additive decomposition by 


qo 
xt b> log (u(x*)) = a0 + a1 X Badj, 
j=l 


with real-valued parameters ag and a being estimated on line 11 of Listing 11.7. 
Thus, in this implementation the regression attentions are obtained by a £j (x*). 
Of course, there are also other ways of implementing this. This LocalGLMnet 
architecture has 1’799 network weights to be fitted. 

We fit this LocalGLMnet using a training to validation data split of 8 : 2 and a batch 
size of 5’000. We initialize the gradient descent algorithm such that we exactly start 
in the GLM with Bj(x*) = eee For this we set all weights in the last layer 
on line 8 of Listing 11.7 to zero, w) 


the MLEs of the GLM, i.e., w = BME, This gives us the GLM initialization 


20 i pres j on line 10 of Listing 11.7. Moreover, on line 11 of that listing, we 


= 0, and the corresponding intercepts to 


initialize a] = 1 andap = PMLE, This implies that the gradient descent algorithm 
starts in the MLE estimated GLM. The SGD fitting turns out to be faster than in 
the plain-vanilla FN case, probably, because we start in the GLM having already 
the reasonable linear terms x; in the model, and we only need to find the regression 
attentions 6 ; (xt) around these linear terms. The results are presented on the second 
last line of Table 11.10. The out-of-sample results are slightly worse than in the 
plain-vanilla FN case. There are many reasons for that, for instance, many levels in 
one-hot encoding may lead to more potential for over-fitting, and hence to an earlier 


UbBWNe 
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Table 11.10 Run times, number of parameters, in-sample and out-of-sample deviance losses 
(units are in 107°) and in-sample average frequency of the Poisson regressions, see also Table 7.3 


Run | # In-sample | Out-of-sample | Aver. 

time | param. | losson £ | loss on 7 freq. 
Poisson null - 1 25.213 | 25.445 7.36% 
Poisson GLM3 15s | 50 24.084 | 24.102 7.36% 
One-hot FN (q1, q2, q3) = (20, 15, 10) | 51s |1306 |23.757 23.885 6.96% 
LocalGLMnet on x+ 20s |1799 |23.728 [23.945 7.46% 
LocalGLMnet on x* bias regularized |- |- 23.727 | 23.943 7.36% 


stopping, here. The same applies if we add too many purely random components 
Xq+i,/ > 1. Since the balance property will not hold, in general, we apply the bias 
regularization step (7.33) to adjust ap and a1, the results are presented on the last 
line of Table 11.10; in Remark 3.1 of Richman—Wiithrich [317] a more sophisticated 
balance property correction is presented. Our goal now is to analyze this solution. 


Listing 11.8 Extracting the regression attentions from the LocalGLMnet architecture 


ZZ <- keras_model (inputs=model$input, 
outputs=get_layer(model, ‘Attention’ ) $output) 

beta <- data.frame(zz %>% predict (list (Xlearn, Vlearn) ) ) 

alphal <- as.numeric(get_weights (model) [[9]]) 

beta <- beta +» alphal 


We start by analyzing the two additional components x;,,41 and xi q+2 being 
uniformly and Gaussian distributed, respectively. Listing 11.8 shows how to extract 
the estimated regression attentions Bat). We calculate the means and standard 
deviations of the estimated regression attentions of the two additional components 


bg+1 = 0.0042 and bein = 0.0213, 
and 

Sq+1 = 0.0516 and Sy+2 = 0.0482. 
From these numbers we see that the regression attentions Bo42(Xi ) are slightly 
biased, whereas B41 (i) are fairly centered compared to the magnitudes of the 
standard deviations. If we select a significance level of a = 0.1%, we receive a 


two-sided standard normal quantile of |®~!(a/2)| = 3.29. This provides us for 
interval (11.41) with 


Iu = [o'a S41, ® 1(1 —a/2) Fy | = [—0.17, 0.17]. 
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Fig. 11.14 Estimated regression attentions Bix) of the continuous and binary feature compo- 
nents Area, BonusMalus, log-Density, DrivAge, VehAge, VehGas, VehPower and the 
two random features x;,g+1 and x;,q+2 of 2’000 randomly selected policies x7; the orange area 
shows the interval J, for dropping term £; (x)x; on significance level a = 0.1% 


Figure 11.14 shows the estimated regression attentions Bj (x7) of the continuous 
and binary feature components for 2’000 randomly selected policies aa and the 
orange area shows the acceptance region J, on significance level a = 0.1%. 
Focusing on the figures of the two additional variables x; 741 and x; q+2, Fig. 11.14 
(bottom, middle and right), we observe that the estimated regression attentions are 
mostly within the confidence bounds of Ie. This says that we should drop these 
two terms (of course, this is clear since we have set the bounds according to these 
regression attentions). Focusing on the other variables, we question the inclusion 
of the term VehPower as it seems concentrated within Jy, and hence we cannot 
reject the null hypothesis Ho : BvehPower(x) = 0. Moreover, the inclusion of the 
term Area needs further exploration. 
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Table 11.11 Run times, number of parameters, in-sample and out-of-sample deviance losses 
(units are in 107°) and in-sample average frequency of the Poisson regressions, see also Table 7.3 


Run | # In-sample | Out-of-sample | Aver. 

time | param. | losson £ | loss on 7 freq. 
Poisson null - 1 25.213 | 25.445 7.36% 
Poisson GLM3 15s | 50 24.084 | 24.102 7.36% 
One-hot FN (q1, q2, 93) = (20, 15, 10) |51s |1306 |23.757 23.885 6.96% 
LocalGLMnet on x+ 20s |1799 |23.728 [23.945 7.46% 
LocalGLMnet on x™ bias regularized |— = 23.727 23.943 7.36% 
LocalGLMnet on x7 20s |1675 | 23.715 23.912 7.30% 
LocalGLMnet on x: bias regularized = = 23.714 23.911 7.36% 


We remind that dropping a term 6;(x)xj; does not necessarily imply that we 

have to completely drop x; because it may still play an important role in one of the 
other regression attentions £; (x), j’ # j. Therefore, we re-run the whole fitting 
procedure, but we drop the purely random feature components x;,g41 and Xj,q+2, 
and we also drop VehPower and Area to see whether we receive a model with a 
similar predictive power. This then would imply that we can drop these variables, in 
the sense of variable selection similar to the LRT and the Wald test of Sect. 5.3. We 
denote the feature where we drop these components by x~ € R@~?. 
We re-fit the LocalGLMnet on the reduced features x; , and the results are presented 
in Table 11.11. We observe that the loss figures decrease. Indeed, this supports the 
null hypothesis of dropping VehPower and Area. The reason for being able to 
drop VehPower is that it does not contribute (sufficiently) to explain the systematic 
effects in the responses. The reason for being able to drop Area is slightly different: 
we have seen that Area and log-Density are highly correlated, see Fig. 13.12 
(ths), and it turns out that it is sufficient to only keep the Density variable (on the 
log-scale) in the model. 

In a next step, we should analyze the robustness of these results by exploring the 
nagging predictor and/or bootstrapping as described in Sect. 11.4. We refrain from 
doing so, but we illustrate the LocalGLMnet solution of Table 11.11 in more detail. 
Figure | 1.15 shows the feature contributions Ê j (x7 )xi, j of 2’000 randomly selected 
policies on the significant continuous and binary feature components. The magenta 
line gives a spline fit, and the more the black dots spread around these splines, the 
more interactions we have; for instance, higher bonus-malus levels interact with the 
age of driver which explains the scattering of the black dots. On average, frequencies 
are increasing in bonus-malus levels and density, decreasing in vehicle age, and for 
the driver’s age variable it is important to understand the interactions. We observe 
that the spline fit for the log-Density is close to a linear function, this reflects 
that the regression attentions ere (x;) in Fig. 11.14 (top-right) are more or less 
constant. This is also confirmed by the marginal plot in Fig. 5.4 (bottom-rhs) which 
has motivated the choice of a linear term for the log-Density in model Poisson 
GLM1 of Table 5.3. 
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Fig. 11.16 Importance importance measure 
measure IM; of the : i ; ; ; ; : 
continuous and binary Bonus—Malus a a ae a ae | 
variables Driver's Age == | | | 
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Using the regression attentions we define an importance measure. We consider 
the extended features xt in the following numerical analysis. We set 


’ 


1/4 
IMj = — ) AED 


i=l 


for | < j < q + 2, and where we aggregate over all policies 1 <i < n. 

Figure 11.16 shows the importance measures IM ; of the continuous and binary vari- 
ables j. The bars are ordered w.r.t. these importance measures. The graph confirms 
our previous conclusion, the least important variables are the two additional purely 
random components x;,g+1 and xi q+2, followed by Area and VehPower. These 
are exactly the components that have been dropped going from the full model x* to 
the reduced model x7. 

Next, we analyze the interactions by studying the gradients (11.38). Figure 11.17 
illustrates spline fits to the components dB; (x; ) /Oxx W.r.t. x; of the continuous 
variables BonusMalus, log-Density, DrivAge and VehAge over all policies 
i = 1,...,n. The components 0B; (x; )/dx; show the non-linearity in x;. We 
conclude that BonusMalus, DrivAge and VehAge should be non-linear, and 
log-Density is linear because əb; (x; )/dx; ~% 0. The components əb; (x; )/3Xk, 
k # j, determine the interactions. We have the strongest interactions between 
BonusMalus and DrivAge, and BonusMalus has interactions with all vari- 
ables. On the other hand, the log-Density only interacts with BonusMalus. 
The reader will have noticed that we have excluded the categorical components 
VehBrand and Region from all model discussions. Firstly, these components are 
not standardized to zero mean and unit variance, and, secondly, we cannot study one 
level in isolation to be able to decide to keep or drop that variable. I.e., similar to 
group LASSO we need to study all levels simultaneously of each categorical feature 
component. We do this in the next section, and we conclude with the regression 
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Fig. 11.17 Spline fits to the derivatives aB; (x; )/ðxk Ww.r.t. x; of the continuous variables 
BonusMalus, log-Density, DrivAge and VehAge over all policies i = 1,...,n 


attentions Bj (x) of the categorical feature components in Fig. 11.18, which seem to 
be significantly different from zero (VehBrands B10, B11, and Regions R22, 
R43, R82, R93), but which do not allow for variable selection as just described. 


Remark 11.13 The bias regularization in Table 11.11 has simply been obtained by 
applying an additional MLE step to ag and œ1. Alternatively, we can also define 
the new features Z; = (@ Bi (Xj) Xi,1,---; 1 Bao (xig) € IR%, and then apply 
a proper GLM step to these newly (learned) features Z|, . . . , Zn. Working with the 
canonical link will give us the balance property. This is discussed in more detail in 
Remark 3.1 of Richman—Wiithrich [317]. 
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Fig. 11.18 Boxplot of the regression attentions Ê; (x) of the categorical feature components 
VehBrand and Region; the y-scale is the same as in Fig. 11.15 


11.5.4 Variable Selection Through Regularization of the 
LocalGLMnet 


A natural next step is to introduce regularization on the regression attentions 
B(x); this is the proposal suggested in Richman—Wiithrich [318]. We choose the 
LocalGLMnet architecture x > (x) of Definition 11.12 having an intercept 
parameter 6o € R and the network weights w. For fitting, we consider a loss 
function L and we add a regularization term to this loss function penalizing large 
regression attentions. That is, we aim at minimizing 


arg min > L (Yi, wxi)) — R(B(xi)), (11.42) 


n 
Bow © i=l 


with a penalty term (regularizer) R(-) > 0. For the penalty term §% we can choose 
different forms, e.g., the elastic net regularizer of Zou—Hastie [409] is obtained by, 
see Remark 6.3, 


1 n 
argmin = JL (Yi, uE) +n (= IBIS + aliB(xi)ll1), (11.43) 
j=l 


Bow ” 


for a regularization parameter n > 0 and weight œ € [0, 1]. For a = 0 we receive 
ridge regularization, and for œ = 1 we get LASSO regularization of B(-). 
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For variable selection of categorical feature components we should rather use the 
group LASSO penalization of Yuan—Lin [398], see also (6.5). Assume the features 


x have a natural group structure x = (l, eee xk)! e R1. We consider the 
optimization 
1 n K 
argmin =) L (Yi, wei) + >) nel Be, (11.44) 
Bow © iz k=1 


for regularization parameters ng > 0, and where B(x) collects all components 
Bj (x) of B(x) that belong to the k-th group x; of x. Yuan—Lin [398] propose to 
scale the regularization parameters as nx = ./qxn = 0, where q is the size of group 
k. Remark that if every group has size one we exactly obtain LASSO regularization. 
Solving the optimization problem (11.44) poses some challenges because the 
regularizer is not differentiable in zero. In Sect. 6.2.5 we have presented the 
generalized projection operator (using the soft-thresholding operator) to solve 
the group LASSO regularization within GLMs. However, this proposal will not 
work here: the generalized projection operator may help to project the regression 
attentions B(x;) back to the constraint set C. However, this does not tell us anything 
about how to choose the network parameters w and, therefore, will not work 
here. In a different setting, Oelker—Tutz [288] propose to use a differentiable €- 
approximation to the terms in (11.44). Choose € > 0 and define for B, € R% 


lbzl2 e = VIB IZ +e = VBe Bete > UByll2 ase 0. (11.45) 


This motivates to study the optimization problem for a fixed (small) € > 0 


1 n K 
arg min — X L (Yi, u) + >) nell Bellz e- (11.46) 
i=l 


n 
Bo, w TR = 


In Fig. 11.19 we plot these €-approximations for € € {10 ” 10 2. 10 3 10 4 
1075}. The plot on the left-hand side gives B € R > |All. = VB? + € > |B| for 
e } 0, and the plot on the right-hand side gives the unit ball 


Be = |B = (1, Bo)" E€ R3 Bille + lllz = 1} 


For the last two e choices there is no visible difference to the €;-norm. 
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Fig. 11.19 (lhs) Comparison of || and ||f||2,. = V8? + € for B € R, and (rhs) unit balls Be for 
e € {107}, 10-7, 107°, 1074, 1075} compared to the Manhattan unit ball 


The main disadvantage of the €-approximation is that it does not shrink unimportant 
components 6 ;(x) exactly to zero. But it allows us to identify unimportant (small) 
components, which can then be removed manually. As mentioned in Lee et al. [237], 
LASSO regularization needs a second model calibration step only fitting the model 
on the selected components (and without regularization) to receive an optimal 
predictive power and a minimal bias. Thus, we need a second calibration step after 
the removal of the unimportant components anyway. 


11.5.5 Lab: LASSO Regularization of LocalGLMnet 


We revisit the LocalGLMnet architecture applied to the French MTPL claim fre- 
quency data, see Sect. 11.5.3. The goal is to perform a group LASSO regularization 
so that we can also study the importance of the terms coming from the categorical 
feature components VehBrand and Region. We first pre-process all feature 
components as follows. We apply dummy coding to the categorical variables, and 
then we standardize all components to centering and unit variance, this includes the 
dummy coded components. 

In a next step we need to define the natural groups x = Gi. onan x)! e R1. We 
have 7 continuous and binary components which give us dimensions qg = 1 for 
1 < k < 7. VehBrand provides us with a group of size qg = 10, and Region 
gives us a group of size gg = 21. We set K = 9 and q = 4 qk = 38. We code 
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Listing 11.9 Group LASSO regularization design 


group.lasso.grouping <- function (xx) { 
pp <- array(0, dim=c (length (xx) ,sum(xx) ) ) 
for (k in 1:length(xx) ) { 
if (k==1) {pp[k,1:xx[k]] <- 1 


Jelse{ 
pp [k, (sum(xx[1: (k-1)])+1):sum(xx[1:k])] <- 1 
}} 

t (pp) 

} 


# 
ww <- group.lasso.grouping(c(rep(1,7),10,21)) 12 etaK <- eta 
etaK <- eta * sqrt(c(rep(1,7),10,21)) 


a (sort of) regularization design matrix to encode the K groups and weights ./gqx 
for the q components of x. This is done in Listing 11.9 providing us with a matrix 
of size 38 x 9 and the weights ./qx. This regularization design matrix enters the 
penalty term on lines 13 and 16 of Listing 11.10 which weights the penalizations 


Il - ll2.e- 


Listing 11.10 LocalGLMnet with group LASSO regularization 


Design = layer_input (shape = c(38), dtype = ‘float32’) 
LogVol = layer_input(shape = c(1), dtype = ’float32’) 
Biasl = layer_input(shape = c(1), dtype = ’float32’) 
# 


Attention = Design %>% 
ayer _dense(units=15, activation=’tanh’) % 
ayer_dense(units=10, activation=’tanh’) % 
ayer _dense(units=38, activation=’linear’, name=’Attention’ ) 
# 
Penalty = Attention %>% 
ayer _lambda(function(x) k_square(x)) %>% 
ayer_dense(units=9, activation='’linear’, 

weights=list (ww), use_bias=FALSE, trainable=FALSE) %>% 
ayer_lambda(function(x) k_sqrt(x+epsilon)) %>% 
ayer _dense(units=1, activation='’linear’, 


weights=list (array (etaK, dim=c(9,1))), use_bias=FALSE, trainable=FALSE) 
# 
LocalGLM = list (Design, Attention) %>% layer dot (axes=1) 
# 


Bias = Biasl %>% 
layer_dense(units=1, activation=’linear’, use _bias=FALSE) 
# 
Response = list (LocalGLM, Bias, LogVol) %>% layer_add() %>% 
layer_lambda(function(x) k_exp(x) ) 
# 
Output = list (Response, Penalty) %>% layer concatenate () 
# 
keras_model(inputs = c(Design, LogVol, Biasl1), outputs = c(Output) 


ja 
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The entire group LASSO regularized LocalGLMnet is depicted in Listing 11.10, 
showing the regression attentions on lines 5-8, the regularization on lines 10-16, 
and the output on line 26 returns the expected response v; u(x;) and the regularizer 
Sii nk | Bx (xi) |l2,e, we choose € = 1075 for our example. 


Listing 11.11 Group LASSO regularized Poisson deviance loss 


Poisson.reg <- function(y_true, y_pred){k_mean( 
y_pred[,1]-y_true[,1] + y_true[,1]»k_log((y_true[,1]/y_pred[,1]+.00000001)) 
+ y_pred[,2] )} 


Finally, we need to code the loss function (11.42). This is done in Listing 11.11. We 
combine the Poisson deviance loss function with the group LASSO e-approximation 
Tj nk\|B,(Xi)\l2,<, the latter being outputted by Listing 11.10. We fit this 
network to the French MTPL data (as above) for regularization parameters n € 
{0, 0.0025, 0.005}. Firstly, we note that the resulting networks are not fully compet- 
itive, this is probably due to the fact that the high-dimensional dummy coding leads 
to too much over-fitting potential which leads to a very early stopping in gradient 
descent fitting. Thus, this approach may not be useful to directly receive a good 
predictive model, but it may be helpful to select the right feature components to 
design a good predictive model. 

Figure 11.20 gives the importance measures of the estimated regression attentions 


jie 
IM; = T2 |B; (xi) 


i=1 


’ 


of all components | < j < g = 38. The red color corresponds to regularization 
parameter 7 = 0.005, red + yellow colors to n = 0.0025, and red + yellow + green 
colors to n = O (no regularization). Figure 11.20 (Ihs) shows the results on the 
original (standardized) features x. By far the smallest red + yellow column among 
the continuous features is observed for VehPower which confirms the variable 
selection of Sect. 11.5.3. Among the categorical variables Region seems more 
important (on average) than VehBrand because the red and yellow columns are 
generally bigger for Region. All these red and yellow columns of VehBrand and 
Region are bigger than the ones of VehPower which supports the inclusion of 
the two categorical variables. 

Figure 11.20 (rhs) verifies this decision of keeping the categorical variables. For 
this latter graph we randomly permute Region across the entire portfolio, and we 
run the same group LASSO regularized fitting procedure again on this modified 
data. The vertical black line shows the average importance of the permuted Region 
variable for 7 = 0.0025. We see that only VehPower has a smaller importance 
measure, and all other variables dominate the permuted Region variable. This 
confirms our conclusions above. 
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importance measure (group Lasso) 


importance measure (group Lasso) 
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Fig. 11.20 Importance measures IM; of the group LASSO regularized LocalGLMnet for variable 
selection with different regularization parameters 7 € {0, 0.0025, 0.005}: (lhs) original data, and 
(rhs) randomly permuted Region labels; the x-scale is the same in both plots 


We conclude that the LocalGLMnet architecture with a group LASSO regular- 
ization is helpful for variable selection, and, more generally, the LocalGLMnet 
architecture is useful for model interpretation, finding interactions and functional 
forms of the features entering the regression function. In examples that have 
categorical variables with many levels, the LocalGLMnet approach may not lead 
to a regression model that is fully competitive. In this case, the LocalGLMnet can 
be used for variable selection, and an other network architecture should then be fitted 
on the selected variables. Alternatively, we can embed the categorical variables in a 
preparatory network step, and then work with these embeddings of the categorical 
variables (kept fixed within the LocalGLMnet). 
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11.6 Selected Applications 


11.6.1 Mixture Density Networks 


In Sect. 6.3 we have introduced mixture distributions and we have presented the EM 
algorithm for fitting these mixture distributions. The EM algorithm considers two 
steps, an expectation step (E-step) and a maximization step (M-step). The E-step is 
motivated by (6.34). In this step the posterior distribution of the latent variable Z 
is determined, given the observation Y and the parameter estimates for the model 
parameters 0 and p. The M-step (6.35) determines the optimal model parameters 
0 and p, based on the observation Y and the posterior distribution of Z. Typically, 
we explore MLE in the M-step. However, for the EM algorithm to function it is not 
important that we really work with the maximum in the M-step, but monotonicity 
in (6.38) is sufficient. Thus, if at algorithmic time tf — 1 we have a parameter 
estimate oO”, PCD), it suffices that the next estimate Oo”, p) increases the 
log-likelihood, without necessarily being the MLE; this latter approach is called 
generalized EM (GEM) algorithm. Exactly this point makes it feasible to also use 
the EM algorithm in cases where we model the parameters through networks which 
are fit using gradient descent (ascent) algorithms. These methods go under the name 
of mixture density networks (MDNs). 

MDNs have been introduced by Bishop [35], who explores MDNs on Gaussian 
mixtures, and using SGD and quasi-Newton methods for model fitting. MDNs have 
also started to gain more popularity within the actuarial community, recent papers 
include Delong et al. [95], Kuo [230] and Al-Mudafer et al. [6], the latter two 
considering MDNs for claims reserving. 

We recall the mixture density for a selected member of the EDF. The incomplete 
log-likelihood of the data (Y;, xi, vj)1<i<n is given by, see (6.24), 


(0,9, p) > ly (0.9, p) = >> by, O(xi), (xi), p(xi)) 
i=1 


n K 
vi 
= X log (> PR(Xi) fk (7: Ok (xi), )) > 


i=l k=1 
for canonical parameter 0 = (A, ee Ox)! c © = O; x- x Ox, dispersion 
parameter g = (@1,...,@K) € RX, mixture probability p € Ax, and K denotes 


the number of mixture components. MDNs model these parameters with networks. 
Choose a FN network z@!) : Rt! — {1} x R% of depth d, with input dimension q 
being equal to the dimension of the features x € ¥ C {1}x R1 and output dimension 
qa + 1. This gives us the learned representations z; = z‘“'")(x;). These learned 
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representations are used to model the parameters. For the mixture probability p we 
build a logistic categorical GLM, based on z;. For the (canonical) link h, we set 
linear predictor, see (5.72), 


APE) =h (p (zi) = (BF zi), Bp) E RE, (11.47) 


with regression parameter B? = (Br), weds (Be)")" e RX@ui+)_ For the 
canonical parameter 0, the mean parameter mw, respectively, and the dispersion 
parameter ø we proceed analogously. Choose strictly monotone and smooth link 
functions g,, and gg, and consider the double GLMs, for 1 < k < K, on the learned 
representations Z; 


8u (uk (Zi)) = (By zi) and = gol(øk(zi)) = (B$, zi), (11.48) 
with regression parameters B“ = (oo; ere (Bie)! e RX@+) for the 
mean parameters and B® = ((B{)',...,(B&)')' € RX“) for the dispersion 


parameters. Thus, altogether this gives us a network parameter of dimension, set 
90 =q, 


d 
r= Y qm(Qm—1 + 1) +3K(qa + 1). 


m=1 


Remarks 11.14 


e The regression functions (11.47)-(11.48) use a slight abuse of notation, because, 
strictly speaking, these should be functions w.r.t. the features x; € Æ, i.e., 
we should understand the learned representations z; as a short form for x; b> 
zd ‘D(x;). 

e It is not fully correct to say that (11.47) is the logistic categorical GLM 
of formula (5.72), because (11.47) does not lead to identifiable regression 
parameters. In fact, we should reduce the dimension of the categorical GLM to 
K — 1, by setting bk = 0, see (5.70), because the probability of the last label 
K is fully determined if we know the probabilities of all other labels; this would 
also justify to say that h is the canonical link. Since in FN network modeling we 
do not have identifiability anyway, we neglect this normalization (redundancy), 
see line 16 of Listing 11.12, below. 

e The above proposal (11.47)-(11.48) suggests to use the same network z“) 
for all mixture parameters involved. This requires that the chosen network is 
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sufficiently large, so that it can comply simultaneously with these different tasks. 
Alternatively, we could choose three separate (parallel) networks for p, wu and 
Q, respectively. This second proposal does not (easily) allow for (non-trivial) 
interactions between the parameters, and it may also suffer from less robustness 
in fitting. 

e Proposal (11.48) defines double GLMs for the mixture components fk, 1 < k < 
K. If we decide to not model the dispersion parameters feature dependent, i.e., if 
we set ox (Z) = ok E€ R+, then the mixture components are modeled with GLMs 
on the learned representations z; = z‘“‘!)(x;). Nevertheless, this latter approach 
still requires that the dispersion parameters gx are set to reasonable values, as 
they enter the score equations, this can be seen from (6.29) adapted to MDNs. 
Thus, in MDNs, the dispersion parameters do not cancel in the score equations, 
which is different from the single distribution case. The dispersion parameter can 
either be estimated (updated) during the M-step of the EM algorithm (supposed 
we use the EM algorithm), or it can be pre-specified as a given hyper-parameter. 

e As mentioned in Sect. 6.3, mixture density fitting can be challenging because, 
in general, mixture density log-likelihoods are unbounded. Therefore, a suitable 
initialization of the EM algorithm is important for a successful model fitting. 
This problem is less pronounced in MDNs as we use early stopping in SGD 
fitting that prevents the fitted parameters to depend on a small set of observations. 
For instance, Example 6.13 cannot occur because an individual observation Yı 
enters at most one (mini-)batch of SGD, and the SGD algorithm will provide 
a good balance across all batches. Moreover, early stopping will imply that the 
selected parameters must also be good on the validation data being disjoint (and 
independent) from the training data. 

e Delong et al. [95] present two different ways of fitting such MDNs. The crucial 
property in EM fitting is to preserve the monotonicity in the M-step. For MDNs 
this can either be achieved by using the parameters as offsets for the next EM 
iteration (this is called ‘EM network boosting’ in Delong et al. [95]) or to forward 
the network weights from one to the next loop (called ‘EM forward network’ 
in Delong et al. [95]). We are going to present the second option in the next 
example. 


Example 11.15 (Gamma Claim Size Modeling and MDNs) We revisit Exam- 
ple 6.14 which models the claim sizes of the French MTPL data. For the modeling 
of these claim sizes we choose the mixture distribution (6.39) which has four 
gamma components fi, ..., f4 and one Lomax component fs. In a first step we 
again model these five mixture components independent of the feature information 
x, and the feature information only enters the mixture probabilities p(x) € As. 
This modeling approach has been motivated by Fig. 13.17 which suggests that 
the features mainly result in systematic effects on the mixture probabilities. We 
choose the same model and feature information as in Example 6.14. We only 
replace the logistic categorical GLM part (6.40) for modeling p(x) by a depth 
d = 2 FN network with (q1, g2) = (20, 10) neurons. Area, VehAge, DrivAge 


Nee eee eee eH 
SOCMADNEWNKTCOMIDWNAEWNHHE 


516 11 Selected Topics in Deep Learning 


and BonusMalus are modeled as continuous variables, and for the categorical 
variables VehBrand and Region we choose two-dimensional embedding layers. 


Listing 11.12 R code of the MDN for modeling the mixture probability p(x) 


Design = layer_input (shape = c(4), dtype = ’float32’) 
VehBrand = layer input (shape = c(1), dtype = ‘int32’) 
Region = layer_input (shape = c(1), dtype = ‘int32’) 
Bias = layer_input (shape = c(1), dtype = ‘float32’) 
# 

BrandEmb = VehBrand %>% 


layer_embedding(input_dim = 11, output dim = 2, input_length = 1) %>% 
layer_flatten() 

RegionEmb = Region %>% 
layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>% 
layer_flatten() 


# 
pp = list (Design, BrandEmb, RegionEmb) %>% layer _concatenate() %>% 
layer_dense(units=20, activation=’tanh’) %>% 
layer _dense(units=10, activation=’tanh’) %>% 
layer_dense(units=5, activation='’softmax’ ) 
# 
mu = Bias %>% layer _dense(units=4, activation=’exponential’, 
use_bias=FALSE) 
# 
tail = Bias %>% layer _dense(units=1, activation=’sigmoid’, 
use_bias=FALSE) 
# 
shape = Bias %>% layer _dense(units=4, activation=’exponential’, 
use_bias=FALSE) 


# 

Response = list(pp, mu, tail, shape) %>% layer concatenate () 

# 

keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response) ) 


Listing 11.12 shows the chosen network. Lines 13—16 model the mixture probability 
p(x). We also integrate the modeling of the (homogeneous) parameters of the 
mixture densities fi, ..., fs. Lines 18 and 24 of Listing 11.12 consider the mean 
and shape parameter of the gamma components, and line 21 the tail parameter 1/5 
of the Lomax component. Note that we use the sigmoid activation for this Lomax 
parameter. This implies 1/85 € (0, 1) and, thus, 65 > 1, which enforces a finite 
mean model. The exponential activations on lines 18 and 24 ensure positivity of 
these parameters. The input Bias to these variables is simply the constant 1, which 
is the homogeneous case not differentiating w.r.t. the features. 

Observe that in most of the networks so far, the output of the network was 
equal to an expected response of a random variable that we try to predict. In 
this MDN we output the parameters of a distribution function, see line 27 of 
Listing 11.12. In our case this output has dimension 14, which then enters the score 
in Listing 11.13. Ina first attempt we fit this MDN brute-force by just implementing 
the incomplete log-likelihood received from (6.39). Since the gamma function 
T(-) is not easily available in keras [77], we replace the gamma density by its 
saddlepoint approximation, see Sect. 5.5.2. Listing 11.13 shows the negative log- 
likelihood of the mixture density that is used to perform the brute-force SGD fitting. 
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Listing 11.13 Mixture density negative incomplete log-likelihood 


mixture LogLikeli <- function(true, pred) { - k_mean(k_log( 
pred[,1]*k_exp(-k_log(2*pixtrue[,1]*2/pred[,11])/2 - 
pred[,11]*(true[,1]/pred[,6]-1+k_log(pred[,6]/true[,1]))) + 
pred[,2]*k_exp(-k_log(2*pixtrue[,1]*2/pred[,12]) /2 - 
pred[,12]*(true[,1]/pred[,7]-1+k_log(pred[,7]/true[,1]))) + 
pred[,3]*k_exp(-k_log(2*pixtrue[,1]*2/pred[,13]) /2 - 
pred[,13]*(true[,1]/pred[,8]-1+k_log(pred[,8]/true[,1]))) + 
pred[,4]*k_exp(-k_log(2*pixtrue[,1]*2/pred[,14]) /2 - 
pred[,14]*(true[,1]/pred[,9]-1+k_log(pred[,9]/true[,1]))) + 
pred[,5]*k_exp(k_log(1/ (pred[,10] *M) ) - (1/pred[,10] +1) 
*k_log(true[,1] /M+1)))) 
} 


Lines 2-9 give the saddlepoint approximations to the four gamma components, and 
line 10 the Lomax component for the scale parameter M. Note that this brute-force 
approach is based only on the incomplete observation Y encoded in true[,1], 
see Listing 11.13. 

We fit this logistic categorical FN network of Listing 11.12 under the score function 
of Listing 11.13 using the nadam version of SGD. Moreover, we use a stratified 
training-validation split, otherwise we did not obtain a competitive model. The 
results are presented in Table 11.12 on line ‘logistic FN network: brute-force fitting’. 
We observe a slightly worse performance (in-sample) than in the logistic GLM. This 
does not justify the use of the more complex network architecture. Or in other words, 
feature pre-processing seems to been done suitably in Example 6.14. 

In a next step, we fit this MDN with the (generalized) EM algorithm. The E- 
step is exactly the same as in Example 6.14. For the M-step, having knowledge of 
the (latent mixture component) variables Z, 1 <i < n, implies that the mixture 
probability estimation and the mixture density estimation completely decouples. As 
a consequence, the parameters of the density components fi, ..., fs can directly 
be estimated using univariate MLEs, this is the same as in Example 6.14. The 
only part that needs further explanation is the estimation of the logistic categorical 
FN network for p(x). In each loop of the EM iteration we would like to find the 
optimal network parameter for p(x), and at the same time we have to ensure the 
monotonicity (6.38). Following the “EM forward network’ approach of Delong et 


Table 11.12 Mixture models for French MTPL claim size modeling; we set M = 2'000 


# Param. ly, P) i = Ej plY] 
Empirical 2266 
Null model 13 —199°306 2°381 
Logistic GLM, Example 6.14 193 —198°404 2176 
Logistic FN network: brute-force fitting 520 —198°623 2003 
Logistic FN network: EM fitting 520 —198°449 2119 
MDN: brute-force fitting 825 —198°178 2’ 144 


MDN: EM fitting 825 —198°085 2240 
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al. [95], this is most easily achieved by just initializing the FN network in loop t of 
the algorithm with the optimal network parameter of the previous loop t — 1. Thus, 
the starting parameter of SGD reflects the optimal parameter from the previous 
step, and since SGD generally decreases losses, the monotonicity (6.38) holds. The 
latter statement is not strictly true, SGD introduces additional randomness through 
the building of (mini-)batches, therefore, monotonicity should be traced explicitly 
(which also ensures that the early stopping rule is chosen suitably). We have 
implemented such an EM-SGD algorithm, essentially, we just have to drop lines 
17-28 of Listing 11.12 and lines 13-16 provide the entire response. As loss function 
we choose the categorical (multi-class) cross-entropy loss, see (4.19). The results in 
Table 11.12 on line ‘logistic FN network: EM fitting’ indicate a superior fitting 
behavior compared to the brute-force fitting. Nevertheless, this network approach 
is still not outperforming the GLM approach, saying that we should stay with the 
simpler GLM. 

In a final step, we also model the mean parameters ug(x), 1 < k < 4, of the 
gamma components feature dependent, to see whether we can gain predictive power 
from this additional flexibility or whether our initial model choice is sufficient. For 
robustness reasons we neither model the shape parameters k , 1 < k < 4, of 
the gamma components feature dependent nor the tail parameter 65 of the Lomax 
component. The implementation only requires small changes to Listing 11.12, see 
Listing 11.14. 

A brute-force fitting of the MDN architecture of Listing 11.14 can directly be based 
on the score function (negative incomplete log-likelihood) of Listing 11.13. In the 
case of the EM algorithm we need to change the score function to the complete 
log-likelihood accounting for the variables Zi € As. This is done in Listing 11.15 
where Z; is encoded in the variables true [, 2] to true[,6]. 

We fit this MDN using the two different fitting approaches, and the results are given 
on the last two lines of Table 11.12. Again the performance of the EM fitting is 
slightly better than the brute-force fitting, and the bigger log-likelihoods indicate 
that we can gain predictive power by also modeling the means of the gamma 
components feature dependent. 

Figure 11.21 compares the QQ plot of the resulting MDN with EM fitting to the 
one received from the logistic categorical GLM of Example 6.14. These graphs are 
very similar. We conclude that in this particular example it seems that the simpler 
proposal of Example 6.14 is sufficient. a 


In a next step, we try to understand which feature components influence the mix- 
ture probabilities p(x) = (pi(%),..., PK (x))! most. Similarly to Examples 6.14 
and 11.15, we therefore use a MDN where we only fit the mixture probability 
p(x) with a network and the mixture components fi,..., fx are assumed to be 
homogeneous. 


Example 11.16 (MDN with LocalGLMnet) We revisit Example 11.15. We choose 
the mixture distribution (6.39) which has four gamma components fi, ..., f4 and 
a Lomax component f5. We select their parameters independent of the features. 
The feature information x should only enter the mixture probability p(x) € As, 
similarly to the first part of Example 11.15. We replace the logistic FN network of 
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Listing 11.14 R code of the MDN for modeling the mixture probability p(x) and the gamma 
means g(x) 


Design = layer input (shape = c(4), dtype = ’float32’) 
VehBrand = layer_input(shape = c(1), dtype = ‘int32’) 
Region = layer_input (shape = c(1), dtype = ‘int32’) 
Bias = layer_input (shape = c(1), dtype = ‘float32’) 
# 

BrandEmb = VehBrand %>% 


layer_embedding(input_dim = 11, output_dim = 2, input_length = 1) %>% 
layer _flatten() 

RegionEmb = Region %>% 

layer_embedding(input_dim = 22, output_dim = 2, input_length = 1) %>% 
layer _flatten() 

# 

Network = list (Design, BrandEmb, RegionEmb) %>% layer _concatenate() %>% 
layer_dense(units=20, activation=’tanh’) % 
layer_dense(units=15, activation=’tanh’) % 
layer_dense(units=10, activation=’tanh’ ) 


pp = Network %>% layer _dense(units=5, activation=’softmax’ ) 
mu = Network %>% layer _dense(units=4, activation=’exponential’ 


use_bias=FALSE) 


tail = Bias %>% layer _dense(units=1, activation=’sigmoid’, 
use_bias=FALSE) 


shape = Bias %>% layer _dense(units=4, activation=’exponential’, 
use_bias=FALSE) 


Response = list(pp, mu, tail, shape) %>% layer concatenate () 
keras_model(inputs = c(Design, VehBrand, Region, Bias), outputs = c(Response) ) 


Listing 11.15 Mixture density negative complete log-likelihood 


mixture LogLikeli Complete <- function(true, pred){ - k_mean( 
true[,2]*(k_log(pred[,1])-k_log(2«*pixtrue[,1 “2/pred[,11])/2 - 
pred[,11]« (true[,1]/pred[,6]-1+k_log(pred[,6]/true[,1]))) + 
true[,3]*(k_log(pred[,2])-k_log(2«*pixtrue[,1 “2/pred[,12])/2 - 
pred[,12]* (true[,1]/pred[,7]-1+k_log(pred[,7]/true[,1]))) + 
true[,4]*(k_log(pred[,3])-k_log(2«*pixtrue[,1 “2/pred[,13])/2 - 
pred[,13]« (true[,1]/pred[,8]-1+k_log(pred[,8]/true[,1]))) + 


true[,5]*(k_log(pred[,4])-k_log(2«*pixtrue[,1 “2/pred[,14])/2 - 
pred[,14]* (true[,1]/pred[,9]-1+k_log(pred[,9]/true[,1]))) 
true[,6]*(k_log(pred[,5])+k_log(1/(pred[,10] *M) ) - 
(1/pred[,10] +1) *k_log(true[,1] /M+1) ) ) 
} 


+ 


Example 11.15 for modeling p(x) by a LocalGLMnet such that we can analyze the 
importance of the variables, see Sect. 11.5. 

For the feature information we choose the continuous variables Area, 
VehPower, VehAge, DrivAge and BonusMal1us, the binary variable VehGas 
and the categorical variables VehBrand and Region, thus, we extend by 
VehPower and VehGas compared to Example 11.15. These latter two variables 
have not been included previously, because they did not seem to be important 
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Fig. 11.21 QQ plots of mixture models: (lhs) logistic categorical GLM for mixture probabilities 
and (rhs) for MDN with EM fitting 


w.r.t. Fig. 13.17. The continuous and binary variables are centered and normalized 
to unit variance. For the categorical variables we use two-dimensional embedding 
layers, and afterwards they are concatenated with the continuous variables with 
a subsequent normalization layer (to ensure that all components live on the same 
scale). This provides us with a 10-dimensional feature vector. This feature vector 
is complemented with an i.i.d. standard Gaussian component, called Random, 
to perform an empirical Wald type test. We call this pre-processed feature (after 
embedding and normalization of the categorical variables) x € R® with go = 11. 

We design a LocalGLMnet that acts on this feature x € R* for modeling 
a categorical multi-class output with K = 5 levels. Therefore, we choose the 
regression attentions 


z@)) RO > ROK x B(x) = (Bi (x), ..., B(x) = Z(H), 


where z“) is a network of depth d having a matrix-valued output of dimension 
qo x K. For the (canonical) link A, this gives us the predictor, see (5.72), 


h(p(x)) = (B10 + (B1(@).),---, EKo + (Bx Œ), x)) € RX, (11.49) 


with intercepts Bx,9 € R, and where B;(x) € R® is the k-th column of regression 
attention B(x) = z@D(x) e ROK, We also refer to the second item of 
Remarks 11.14 concerning a possible dimension reduction in (11.49), i.e., in fact we 
apply the soft max activation function to the right-hand side of (11.49), neglecting 
the identifiability issue. Moreover, as in the introduction of the LocalGLMnet, we 
separate the intercept components from the remaining features in (11.49). 

We fit this LocalGLMnet-MDN with the EM version presented in Exam- 
ple 11.15. We apply early stopping based on the same stratified training-validation 
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split as in the aforementioned example, and this provides us with a log-likelihood 
of -198’290, thus, slightly bigger than the corresponding numbers in Table 11.12. 
More interestingly, our goal is to understand the regression attentions given by 
B(xi) = (B, (xj), -.., Bs(x;)) € R!*> over all claims 1 < i < n. Figure 11.22 
shows the resulting boxplots, where each of the five graphs corresponds to one 
mixture component | < k < 5, and the different colors illustrate the 11 feature 
components providing the attention weights Bx, ;(x;), 1 < j < 11. The red boxplots 
show the purely random component Random for 1 < k < 5, which provides 
the acceptance region of an empirical Wald test for the null hypothesis that the 
corresponding term should be dropped. This is highlighted by the orange shaded 
area (at a significance level of 0.1%). Thus, whenever a boxplot lies within this 
orange shaded area we may consider dropping this term, e.g., for k = 2 (top-right), 
this is the case for Area, VehPower and Region2 (being the second component 
of the two-dimensional region embedding). Note that this interpretation needs some 
care because we do not have identifiability in the class probabilities. 

The first observation is that, indeed, VehPower is mostly in the orange 
confidence area and, thus, may be dropped. This does not apply to the other feature 
components, and, thus, we should keep them in the model. The three gamma mixture 
components fi, f2 and f3 correspond to the three modes at 75, 600 and 17175 
in Fig. 13.17. Component f4 is a gamma component covering the whole range 
of claims, and fs is the Lomax component modeling the regular variation in the 
tail. Interestingly, DrivAge and BonusMalus seem very important for mixture 
components k = 1, k = 3 and k = 4 (with different signs), this is supported 
by Fig. 13.17. The Lomax component seems mostly impacted by DrivAge, 
VehBrand and Region. Only mixture component k = 2 is more difficult to 
interpret. This component seems influenced by most the feature components, in 
particular, the combination of VehAge, VehGas and VehBrand seems important. 
This could mean that mixture component k = 2 belongs to a certain type of vehicle. 

In a next step we could study interactions and their impact on the mixture 
components, and LASSO regularization would provide us with another method of 
variable selection, see Sect. 11.5.4. We refrain from doing so and close the example. 


11.6.2 Estimation of Conditional Expectations 


FN networks have also found their way into solving risk management problems. 
We briefly introduce a valuation problem and then describe a way of solving 
this problem. Assume we have a liability cash flow Y}.7r = (%,..., Yr) with 
(random) payments Y, at time points t = 1,..., T. We assume that this liability 
cash flow Yı:r is adapted to a filtration (A;)1<;<r on the underlying probability 
space (Q2, A, P). Moreover, we assume to have a pricing kernel (state price deflator) 
wir = (W,...,Wr) on that probability space which is an (A;)1<;<7-adapted 


Selected Topics in Deep Learning 


11 


522 


importance measure for mixture component 2 


importance measure for mixture component 1 


0.44 


-024 


-0.4-4 


-0.64 


zuoiəy 
Luoibeay 
ra SAN 
Laplyen, 
senuen 
wsnuog 
ayaq 
abyyen 
JOMOdUaA 
easy 


wopuey 


guoibey 
juoibeay 
ZƏ19!149A 
Lapiyen 
se9yəA 
wsnuog 
aByAug 
ebyuen 
Jamodyan, 
goy 


wopuey 


importance measure for mixture component 4 


importance measure for mixture component 3 


zuoibey 
Luoibeay 
ra TAN 
Lajalye,, 
seguen 
wsnuog 
eabyAug 
abyyen 
JOMOqUSA 
egay 


wopuey 


zuoibey 
Luoibeay 
zaloiuen 
Lajoyen, 
SEQUeA 
wsnuog 
ebyAug 
ebyueA 
JOModyan, 
egay 


wopuey 


importance measure for mixture component 5 


gzuoiĵəy 
Luoibay 
ZƏ 
Lejaiyen 
seQuan 
wsnuog 
ebyAug 
aByyen 
JAMOqYa/, 
ealy 


wopuey 


,Bs(xi)) € R!!* over all 


Fig. 11.22 Boxplot of regression attentions B(x;) = (B)(xi),... 
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random vector with strictly positive components y; > 0,a.s.,foral1<t<T.A 
no-arbitrage value of the outstanding liability cash flow at time 1 < t < T can be 
defined by (we assume existence of all second moments) 


T 
1 
Re= J, EYA. (11.50) 
s=t+l 


For the mathematical background on no-arbitrage pricing using state price deflators 
we refer to Wiithrich-Merz [393]. The A;-measurable quantity R, is called 
reserves of the outstanding liabilities at time t. From a risk management and 
solvency point of view we would like to understand the volatility in the reserves 
R- seen from time 0, i.e., we try to model the random variable R, seen from time 
0 (based on the trivial o -algebra Ap = {@, Q}). In applied problems, the difficulty 
often is that the conditional expectations under the summation in (11.50) cannot be 
computed in closed form. Therefore the law of 7; cannot be determined explicitly. 

We provide a numerical solution to the calculation of the conditional expectations 
in (11.50). Assume that the information set A; can be described by a random vector 
Xz, i.e., Ar = o (X). In that case we rewrite (11.50) as follows 


T 
1 
R= > TEI Xe. (11.51) 
s=tt+l 


The latter now indicates that we can determine the conditional expectations 
in (11.51) as regression functions in features X+, and we try to understand for s > t 


Vs 
T i) — Y; 
x H È 


T 


XxX, =w] ; (11.52) 


The random variable R+, can then be determined empirically by simulation. This 
requires two steps: (1) We have to be able to simulate Ws Ys /wr, conditionally given 
Xı = xı. This allows us to estimate the conditional expectation (11.52) with a 
regression function. (2) We need to be able to simulate X+. This provides us with 
the empirical occurrence probabilities of specific choices X+ = x; in (11.52) which 
then gives an empirical version of R+. 

In theory, this problem can be approached by nested simulations which is 
a two-stage procedure that first performs step (2), and then calculates step (1) 
empirically with Monte Carlo simulations for every realization of step (2), see, 
e.g., Lee [242] and Glynn—Lee [161]. The disadvantage of this two-stage nested 
simulation procedure is that it is computationally demanding. Building upon the 
work on valuation of American options by Carriere [65], Tsitsiklis—Van Roy [356] 
and Longstaff—Schwartz [257], the papers of Broadie et al. [55] and Ha—Bauer [177] 
propose to regress future cash flows on finitely many basis functions depending on 
the state variable X+. More recently, machine learning tools such as FN networks 
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have been proposed to determine these basis and regression functions, see, e.g., 
Cheridito et al. [74] or Krah et al. [224]. 

In the following, we assume that all random variables considered are square- 
integrable and, thus, we can work in a Hilbert space with the scalar product 
(X, Z) = E[XZ] for X, Z € L?(2, A, P). Moreover, for simplicity, we drop the 
time indices and we also drop the stochastic discounting in (11.52) by assuming 
Ws/Wr = 1. These simplifications are not essential technically and simplify our 
outline. The conditional expectation w(X) = E[Y|X] can then be found by the 
orthogonal projection of Y onto the sub-space o (X), generated by X, in the Hilbert 
space L? (Q2, A, P). That is, the conditional expectation is the measurable function 
u : RI SR, X |> p(X), that minimizes the mean squared error 


B[ -u| = min, (11.53) 


among all measurable functions on X. In Example 3.7, we have seen that u(-) is the 
minimizer of this problem if and only if 


u(x) = arg min f O — m)? dFyx O), (11.54) 
R 


meR 


for px-a.e. x € R1, where py is the distribution of X, and where Fy), is the 
conditional distribution of Y, given feature X = x; we also refer to (3.6). 

Under the assumption that we can simulate observations (Y, X) under P, we can 
solve (11.53)-(11.54) approximately by restricting to a sufficiently rich family of 
regression functions. Choose a FN network z‘@!) : : R? — R% of depth d and the 
identity link g(x) = x. An optimal network parameter P is found by minimizing 


n 


3 = argmin ~ J (Y; - (8.2 xp)) (11.55) 


r n 
deER’ 721 


where (Y;, X;), | < i < n, are i.i.d. copies of (Y, X). This provides us with the 
fitted FN network z Za: DE -) and the fitted output parameter B. These can be used to 
receive an approximation to the conditional expectation, solution of (11.54), 


xp G(x) = ees) x p(x) =E[Y|X =x]. (11.56) 


This then allows us to approximate the random variable in (11.51) empirically by 
simulating features X and inserting them into left-hand side of (11.56). 


Remarks 11.17 


e There are different types of errors involved. First, there is an irreducible 
approximation error if the chosen family of FN networks is not sufficiently 
rich to approximate the conditional expectation well. For example, if we choose 
the hyperbolic tangent activation function, then, naturally, z@!)(-) is uniformly 
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bounded for a fixed network parameter #. This does not necessarily apply to 
the conditional expectation E[Y|X = -] and, thus, the approximation in the tail 
may be poor. Second, we consider an approximation based on a finite sample 
in (11.55). However, this error can be made arbitrarily small by letting n — oo. 
In-sample over-fitting should not be an issue as we may generate samples of 
arbitrary large sample sizes. Third, having the approximation (11.56), we still 
need to simulate i.i.d. samples Xz, k > 1, having the same distribution as X to 
empirically approximate the distribution of the random variable R+ in (11.51). 
Also in this step we benefit from the fact that we can simulate infinitely many 
samples to mitigate this approximation error. 

¢ To fit the network parameter # in (11.55) we use i.i.d. copies (Y;, X;), 1 <i <n, 
that have the same distribution as (Y, X) under P. However, to receive a good 
approximation to regression function x +> u(x) we only need to simulate 
Yi |{x;=x;} from Fy\x;(-) = P[-|X; = x;], and X; can be simulated from an 
arbitrary equivalent distribution to px, and we still get the right conditional 
expectation in (11.54). This is worth mentioning because if we need a higher 
precision in some part of the feature space of X, we can apply a sort of 
importance sampling by choosing a distribution for X that generates more 
samples in the corresponding part of the feature space compared to the original 
(true) distribution px of X; this proposal has been emphasized in Cheridito et 
al. [74]. 


We study the example presented in Ha—Bauer [177] and Cheridito et al. [74]. 
This example considers a variable annuity (VA) with a guaranteed minimum income 
benefit (GMIB), and we revisit the network approach of Cheridito et al. [74]. 


Example 11.18 (Approximation of Conditional Expectations) We consider the VA 
example with a GMIB introduced and studied in Ha—Bauer [177]. This example 
involves a 3-dimensional stochastic process, for t > 0, 


Xi = (qt, ft, Mx41), 


with q; being the log-value of the VA account at time f, 7; is the short rate at time t, 
and m +; is the force of mortality at time ¢ of a person aged x at time 0. The payoff 
at fixed maturity date T > 1 of this insurance contract is given by 


S = S(Xr) = max {eù , bax-r(rr,mx+r)}, 


where e?7 is the VA account value at time T, and b ax+r (rr, mx+7) is the GMIB at 
time T consisting of a face value b > 0 and with ax+r7 (rr, mx+r) being the value 
of an immediate annuity at time T of a person aged x + T. Our goal is to model the 
conditional expectation 


w(Xr) = D(t, T; Xr) E[S(Xr)| Xr] (11.57) 
= D(t, T; X,) E [max {e?, bay r(rr,mx+r)}| Xr], 
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for a fixed valuation time point 0 < t < T, and where D(t, T) = D(t,T; X+) 
is a o(X_)-measurable discount factor. This requires the explicit specification of 
the GMIB term as a function of (rr, mx+7), the modeling of the stochastic process 
(X;)o<r<r, and the specification of the discount factor D(t, T; X+). In financial 
and actuarial valuation the regression function u(-) in (11.57) should reflect a no- 
arbitrage price. Therefore, P in (11.57) should be an equivalent martingale measure 
w.r.t. the selected numéraire. In our case, we choose a force of mortality (mx+r)r- 
adjusted zero-coupon bond price as numéraire. This implies that P is a mortality- 
adjusted forward measure; for details and its explicit derivation we refer to Sect. 5.1 
of Ha—Bauer [177]. In particular, Ha—Bauer [177] introduce a three-dimensional 
Brownian motion based model for (X;); from which they deduce all relevant terms 
explicitly. We skip these calculations here, because, once the GMIB term and the 
discount factor are determined, everything boils down to knowing the distribution 
of the random vector (X+, Xr) under the corresponding probability measure P. We 
choose initial age x = 55, maturity T = 15 and (solvency) time horizon t = 1. 
Under the model and parametrization of Ha—Bauer [177] we receive a multivariate 
Gaussian distribution under P given by 


T T 
(Xz, Xr) = (qr, fr, Mx+r, JT; FT, Mx+T) (11.58) 
4.64 3.2. 1072 —4.8-10-4 1.3.1075 3.1.1072? —1.4.1075 3.6.1075 
0.02 —4.8 -1074 7.9.1075 —4.4. 1077 -1.7-1074 2.4.1076 —1.2. 1076 
N 0.01 1.3.1075 —4.4.1077 1.5.1076 1.2.1075 —1.3.1078 4.1.1076 
~ A 
4.71]? | 31-1072 —1.7-1074 1.2.1075 4.5.107! —1.3.1073 3.0- 107-4 
0.02 —1.4- 1075 2.4.1076 —1.3- 1078 —1.3. 1073 2.0.1074 —2.5.1076 
0.03 3.6: 1075 -1.2-107© 4.1-1076 3.0.1074 -2.5-107 7.4.1075 


Under the model specification of Ha—Bauer [177], one can furthermore work out the 
discount factor and the annuity. Define for t > 0 and k > 0 the affine term structure 


F(t, k; ri, My+t) = exp{A(t,t +k) — Bit, t + k; a)r; — Bit, t + k; —k)my41}, 


with deterministic functions 


i ew ak 


B(t,t+k;a) = 


a 


2 


At, t +k) = 7 (Bt, t +k a) 4 ok 2B(t, t +k: a) + B(t, t +k; 20)) 
OL 


2 
at 2B(t, t +k; —K) + Bt, t +k; —2«)) 
K 


(B(t,t +k; —k)— k+ B(t,t + k;œ)— B(t,t+k;æ —x)), 


0230r Y 
QK 


with parameters for the short rate process a = 25%, or = 1%, for the force of 
mortality x = 7%, y = 0.12%, the correlation between the short rate and the force 
of mortality 02,3 = —4%, and with market-price of the risk-adjusted mean reversion 
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Fig. 11.23 Marginal marginal densities of VA account and GMIB 
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level y = 1.92% of the short rate process. These formulas can be retrieved because 
we work under an affine Gaussian structure. The discount factor is then given by 


D(t, T; X,) = F(t, T —T; rr, Mx4r), 
and the annuity is determined by (we cap at age 55 + 50 = 105) 


50 


ax+T (TT, Mx+T) = > F(T, ky rr, my47). 
k=1 


Moreover, we set for the face value b = 10.79205. This parametrization implies that 
the VA account value e17 exceeds the GMIB bax+r (rr, mx+r) with a probability 
of roughly 40%, i.e., in roughly 60% of the cases we exercise the GMIB option. 
Figure 11.23 shows the marginal densities of these two variables, moreover, their 
correlation is close to 0. 

The model is now fully specified so that we can estimate the conditional expectation 
in (11.57) as a function of X+. We therefore simulate n = 3’000’000 i.i.d. Gaussian 
observations xe. x®), 1 < i < n, from (11.58). This provides us with the 
observations 


Y; = D(t, T; X®) S(X) 


50 
i ‘ (i) P y 
= F(t, T — T; rÒ, mË) max {eff ,b > F(T, k; r®, mË? y | . 
k=l 


The resulting data (Y;, X G) )i<i<n is used for determining the regression function 
u(-) in (11.57). We choose n = 3'000'000 samples in line with the least squares 
Monte Carlo approximation of Ha—Bauer [177]. 
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We choose a FN network of depth d = 3 for approximating u(-). For the three FN 
layers we choose (q1, g2, 493) = (20, 15, 10) neurons with the hyperbolic tangent 
activation function, and as output activation we choose the identity function; we 
choose a more complex network compared to Cheridito et al. [74] because it seems 
that this gives us more accurate results. We fit this FN network using the square loss 
function. The square loss is motivated by (11.55). Furthermore, we average over 20 
runs with different seeds. Thus, we receive 20 fitted FN networks ji; (-) for the 20 
different seeds 1 < k < 20 and the nagging predictor is obtained by averaging 


1 20 
7) = 2 MAON 


We then generate new i.i.d. samples X D 1 < l < L, from the multivariate Gaussian 
distribution (11.58), where this time we only need the first 3 components. This gives 
us the empirical samples 


u(x) forl <1 <L, (11.59) 


providing an empirical distribution Fux that approximates the distribution of 
u(X-=:), given in (11.57). In risk management and solvency analysis, this empirical 
distribution can be used to estimate the Value-at-Risk (VaR) and the (upper) 
conditional tail expectation (CTE) in valuation u(X:), seen from time 0, on 
different safety levels p € (0, 1) 


VaRp = Fix (P) = inf {y € R; Fux) = p}, 


and 


CTE, = Ef y [AX | AX) > VaR] . 


We also refer to Sect. 11.3. The VaR and the CTE are two commonly used risk 
measures in insurance practice that determine the necessary risk bearing capital to 
run the corresponding insurance business. Typically, the VaR is evaluated on p = 
99.5%, i.e., we allow for a default probability of 0.5% of not being able to cover 
the changes in valuation over a t = | year time horizon. Alternatively, the CTE is 
considered on p = 99% which means that we need sufficient capital to cover on 
average the 1% worst changes in valuation over a 1 year time horizon. 

Figure 11.24 shows our FN network approximations. The boxplots shows the 
individual results of the estimates {z (-) with 20 different seeds, and the horizontal 
lines show the results of the nagging predictor (11.59). The red line at 140.97 
gives the estimated VaR for p = 99.5%, this value is slightly bigger than the best 
estimate of 139.47 (orange line) in Ha—Bauer [177] which is based on a functional 
approximation involving 37 monomials and 40’000’000 simulated samples. CTEs 
on p = 99.5% and p = 99% are given by 145.09 and 141.49. We conclude that in 
the present example VaRoo, 5% (used in Europe) and CTE 99% (used in Switzerland) 
are approximately of the same size for this VA with a GMIB. 
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Fig. 11.24 Resulting = 
VaRo9.5% (red), CTE99.5% 
(green) and CTE99% (blue); 

the orange line gives the 

result of Ha—Bauer [177] for 

the 99.5% VaR 34 


x | ; 

$ = 1 
141.49 
140.97 


m 145.09 


139.47 


T T T 
VaR99.5% CTE99.5% CTE99% 


This example shows how problems can be solved that require the computation 
of a conditional expectation. Alternatively, we could explore the LocalGLMnet 
architecture, which would allow us to explain the conditional expectation more 
explicitly in terms of the information X, available at time t. This may also be 
relevant in practice because it allows to determine the main risk drivers of the 
underlying insurance business. 

Figure 11.25 shows the marginal densities of the components of X, = 
(qdr, fr, Mx+r) in blue color. In red color we show the corresponding conditional 
densities of X+, conditioned on (X) > VaRoo 5% thus, these are the feature 
values X, that lead to a shortfall beyond the 99.5% VaR of (X). From this 
figure we conclude that the main driver of VaR is the VA account variable qz, 
whereas the short rate rų and the force of mortality mx+ are slightly lower beyond 
the VaR compared to their unconditioned counterparts. The explanation for these 
smaller values is that they lead to less discounting and, henceforth, to bigger GMIB 
values. This is useful information for exploring importance sampling as mentioned 
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Fig. 11.25 Feature values X, triggering VaR on the 99.5% level: (lhs) VA account log-value qr, 
(middle) short rate r,, and (rhs) force of mortality m,+4,, blue color shows the full density and red 
color shows the conditional density conditioned on being above the 99.5% VaR of ji(Xr) 
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11.6.3 Bayesian Networks: An Outlook 


This section provides a short introduction to Bayesian networks and to variational 
inference. We see this section as a motivation for doing more research in that 
direction. In Sect. 11.4 we have assessed model uncertainty through bootstrapping. 
Alternatively, we could take a Bayesian viewpoint. We start from a fixed network 
architecture that involves a network parameter Ŷ. The Bayesian approach consid- 
ered in Section 6.1 selects a prior density 2 (#) on the space of network parameters 
(w.r.t. a measure v). For given data (Y, x) we can then calculate the posterior density 
of 3 by 


(| Y,x) x f (Y, tlx) = f(Y|0,x)7(0). (11.60) 


A new data point Y’ with feature x‘ has conditional density, given observation 
(Y, x), 


f(y" |xt; Y, x) = | £(s'|0.') PYDd, 
DA 


supposed that (Y, x) and (YÏ, xt) are conditionally independent, given #. Thus, 
there only remains to determine the posterior density (11.60) of the network 
parameter Ŷ. Unfortunately, this is a rather challenging problem because of the 
curse of dimensionality, and even advanced MCMC methods, such as HMC, often 
do not lead to satisfactory results (convergence), for MCMC we refer to Section 6.1. 
For this reason one often explores approximate inference methods, see, e.g., 
Chapter 10 of Bishop [36] or the tutorial of Jospin et al. [205]. A scalable version 
is to approximate the posterior density using the so-called method of variational 
inference. This is presented in the following. 

Choose a family F = {q(-; 0); 8 € ©} of (more tractable) densities that have 
the same support as the prior 7r (-), and being parametrized by 0 € © C RČ. This 
family F is called the set of variational distributions, and the goal is to find the 
variational density g(-; 0) € F that is closest to the posterior density (11.60). 

To evaluate the similarity between two densities, we use the KL divergence which 
analyzes the divergence from x (-| Y, x) to g(-; 0) given by 


q(B; 0) 


Dra (46; O||æ CIY, x) = [ae ooe (A 


) ave. 


The optimal approximation within F, for given data (Y, x), is found by solving 


9 = O(Y,x) = arg min Dx (44; 6)||7 (1 ¥, x); 
cO 


for the moment we neglect existence and uniqueness questions. A main difficulty is 
the computation of this KL divergence because it involves the intractable posterior 
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density of #, given (Y, x). We modify the optimization problem such that we can 
circumvent the explicit calculation of this KL divergence. 


Lemma 11.19 We have the following identity 
log fY |x) = EOY, x) + Dex (46: O||æ CY. x)), 


for the (unconditional) density f(y|x) = Sa f(ylð, x)z(®)dv(®) and the so- 
called evidence lower bound (ELBO) 


fY, |x) 


EO|Y,x) = faw: O)log (oc 


) ave. 


Observe that the left-hand side in the statement of Lemma 11.19 is independent of 
0 € ©. Therefore, minimizing the KL divergence in @ is equivalent to maximizing 
the ELBO in @. This follows exactly the same philosophy as the EM algorithm, 
see (6.32), in fact, the ELBO € plays the role of functional Q defined in (6.33). 
Proof of Lemma 11.19 We start from the left-hand side of the statement 


= ; _ f f, tix) 
logf(Y|x) = | q(; 0)logf (Y |x) dv) = q (0; @)log | =—=—— | dv(v) 
a v a(B|Y, x) 
= . fO, B\x)/q(B; 0) 
~ fae Rg Gar x)/40; J owe 
= EEY, x) + Dex (46 Dle CY, 2). 
This proves the claim. u 


The ELBO provides the lower bound (also called variational lower bound) 


logf(Y|x) > sup £(@]Y, x). 
cO 


Interestingly, the ELBO does not include the posterior density, but only the joint 
density of Y and #, given x, which is assumed to be known (available). It can be 
rewritten as 


EOY, x) 


[ae A)logf (Y, è| x) av(a) — | a0; Ooga: 9) dv) 
v v 


= Eyco|logf (¥, #1) 


Y, x] - Esco |1084: 6)]. 


the first term being the expected joint log-likelihood of (Y, 3) under the variational 
density Ŷ ~ q(-; 0), and the second term being the entropy of the variational density. 
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The optimal approximation within F for given data (Y,x) is then found by 
solving 
8 = 8, x) = argmax E£(0|Y, x). 
2) 


That is we try to simultaneously maximize the expected joint log-likelihood of 
(Y, 0) and the entropy over all variational densities q(-; 0) in F. 

If we have multiple observations D = {(¥j,x;);1 < i < n}, that are 
conditionally i.i.d., given 3, we have to solve (we use conditional independence) 


ê = arg max E£ (0 |D) 
oco 


= arg max E,(..9) i (sf 2.x) f] = ico) |log4 (0: 6] 


4cO i=1 
n 
i q (0; 0) 
= arg max -o| log f (Yil 8, xi) |Yi, xi | | — Egc:0 fioe ( 
gmax (Y Bycofioss ci10.x0|a2]) -Bren [oe (2 


arg max 2 yco|logf Oil, x1) z) — Dx (9¢; O)|). 


4cO i= 


Typically, one solves this problem with gradient ascent methods which requires 
calculation of the gradient Vg of the objective function on the right-hand side. This 
is more difficult than plain vanilla gradient descent in network fitting because 0 
enters the expectation operator Eg ..9). 

Kingma—Welling [217] propose to use the following reparametrization trick. 


Assume that we can receive the random variable 3 ~ q(-; 0) by a reparametrization 


v @ t(€, 0) for some smooth function t and where € ~ p does not depend on @. 


E.g., if # is multivariate Gaussian with mean yw and covariance matrix AA! , then 


d . PA ; ; 
v 2 H + Ae for e being standard multivariate Gaussian. Under the assumption that 


the reparametrization trick works for the family F = {q(-; 0); 0 € ©} we arrive at, 
fore ~ p, 


@ = arg max €(6|D) (11.61) 
6cO 
1 q(t(e, 0); 0) 
Yi, xi] -Er oe ( (tle, 0)) ))) 


Ba s 


n 
= arg max > (efer o t(€, 0), xi) 
cO j=] 


n 
= arg max Ep 


f Yi It(e, 0), xi) 7 (te, 0))1/" 
log | —AMAR@oa9o SA 
660 i=l 


q (t (e, 6); 0)1/” 
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The gradient of the ELBO is then given by (supposed we can exchange Ep and Vg) 
n : ; 1/n 
Y; |t(€,0),x t(é,0 
Vo E(6|D) = Y Bs [vuo J Y;, J . 


= q (t(e, 0); 0)” 
These expected gradients are calculated empirically using Monte Carlo methods. 
Sample i.i.d. observations ED ~ p,1<i<nand1 < j < m, and consider the 
empirical approximation 


ngg f (Yi [tC 0), xi) z (1%), 0)” 
VEIDD) ~ 2 => 2 Yog | —— n C] 
(11.62) 


Using this empirical approximation we can use gradient ascent methods to estimate 
0, known as stochastic gradient variational Bayes (SGVB) estimator, see Sect. 2.4.3 
of Kingma—Welling [217], or as Bayes by Backprop, see Blundell et al. [41] and 
Jospin et al. [205]. 


Example 11.20 We consider the gradient (11.62) for an example from the EDF. 
First, if n is sufficiently large, it often suffices to set m = 1, and we still receive 
an accurate estimate. In that case we drop index j giving €®). Assume that the 
(conditionally independent) observations Y; belong to the same member of the EDF 
having cumulant function «. Moreover, assume that the (conditional) mean of Y;, 
given x;, can be described by a FN network and a link function g such that, see (7.8), 


hi = 1001) = pa (xi) = g7! (B. e 


for network parameter # = (8, w) € R”. In a Bayesian FN network this network 
parameter is not fixed but rather acts as a latent variable. In (11.62) this latent 
variable is for realization i given by (and using the reparametrization trick) } = 
t(e"% ). 0) € R”; 8 is not the canonical parameter, here. Thus, we receive conditional 
mean of Y;, given e€ and Xi, 


= i d: 
Hi = he0; (i) = 8 (Be: 0), a oak 
with network parameter 9 (e®; 0) = (B(e; 6), wie; 6)) = t(e, 0) € R”. 
Maximizing the ELBO implies that we need to calculate the gradients w.r.t. 0. First, 
we calculate the gradient w.r.t. the network parameter # of the data log-likelihood 


Vologf (Yi |9, xi) = Vely, (0) € R”. 


This gradient is calculated with back-propagation, we refer to (7.16) and Proposi- 
tion 7.5. There remains the chain rule for evaluating the inner derivative coming 
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from the reparametrization trick 0 € © C RK > v= t(e®; 0) € R”. Consider 
the Jacobian matrix 


: a À 
J (0; e®) = (Zue: D) eR, 
OK 


I<j<r,lsk<k 


This gives us the gradient w.r.t. 0 


Vologf (Y; t(e, 6), xi) = = J (6; e)" (vseno J er a) € RX. 


(11.63) 
The prior distribution is often taken to be the multivariate Gaussian with prior mean 


t € R” and (symmetric and positive definite) prior covariance matrix T € R”, 
thus, 


1 
(8) = (2x) PT3! exp {-3¢ — t) PG o} 
This implies for the gradient w.r.t. 6 for the prior 
Vologzn(t(e, 0)) = —J (0; ©) T7! (1,6) 2 t) € RK, 


There remains the choice of the family F = {q (; 0); 0 € O} of variational densities 
such that the reparametrization trick works. This is discussed in the remainder. 


We briefly discuss the most popular and simplest family chosen for the varia- 
tional distributions F. This family is the so-called mean field Gaussian variational 
family, meaning that all components of # € R” are assumed to be independent 
Gaussian, that is, 


1 
qb; 0) = Notas 0 [states nor 
J 


for = (141, 01,..., My, or)! € RË with K = 2r and with øj > O forall 1 < j < 
r. This allows us to apply the reparametrization trick 


Mi tole, 
v £ r(e, 0) = u + diag(o1, ..., 0p)€ = f , 
Ur + Or €r 
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with r-dimensional standard Gaussian variable € ~ M (0, 1). The Jacobian matrix 
is 


1«e,00---00 
001e---00 

J@;6)= i of ERSS, 
0000- le, 


The mean field Gaussian case provides the entropy of the variational distribution 


r 


1 i <= 
2 '4¢:0)[logg (0: 6)| =o slosQno7) + 5 = J  log(v 20). 
j=l j=l 


This mean field Gaussian variational inference can be implemented with the R 
package tfprobability of Keydana et al. [212] and an explicit example is 
given in Kuo [230]. 


Example 11.20, Revisited Working under the assumptions of Example 11.20 and 
additionally assuming that the family of variational distributions F is multivariate 


Gaussian q(-; 0) g N (m, £) leads us after some calculation to (the well-known 
formula) 


DKL (ac: o)| ix) = [is (=) — r + trace (ra) +a- w) T(r m| l 


This further simplifies if T and £ are diagonal, the latter being the mean field 
Gaussian case. The remaining terms of the ELBO are treated empirically as 
in (11.63). E 


This section has provided a short introduction to uncertainty estimation in 
networks using Bayesian methods. We believe that this gives a promising outlook 
that certainly needs more theoretical and practical work to become useful in 
practical applications. 
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Chapter 12 ® 
Appendix A: Technical Results on od 
Networks 


The reader may have noticed that for GLMs we have developed an asymptotic 
theory that allowed us to assess the quality of predictors as well as it allowed us to 
validate the fitted models. For networks there does not exist such a theory, yet, and 
the purpose of this appendix is to present more technical results on the asymptotic 
behavior of FN networks and their estimators that may lead to an asymptotic 
theory. This appendix hopefully stimulates further research in this field of statistical 
modeling. 


12.1 Universality Theorems 


We present a specific version of the universality theorems for shallow FN networks; 
we refer to the discussion in Sect. 7.2.2. This section follows Hornik et al. [192]. 
Choose an input dimension go € N and consider the set of all affine functions 


A® = [a :{} x RY > R; xe A(x)= (w,x), we Ret], 


we add a Oth component in feature x = (xo = 1, %1,..., Rag € {1} x R® for the 
intercept. Choose a measurable (activation) function @ : R — R and define 


qı 
LM) = |; : {1} x RY > R; xe f(x)= XO Bj o(Aj(x)), Aj E€ AV, Bj ER, qı € s| ; 


j=0 
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This is the set of all shallow FN networks f(x) = ($, zD (x)) with activation 
function ġ and the linear output activation, see (7.8); the intercept component of 
the output is integrated into the Oth component j = 0. Moreover, we define the 
networks 


qı 


lj 
ET” (g) = [y {1} x R® > R fœ) = D> B; 9A, 
j=0 k=l 


Ajk EAV, Bj ER lj € Nig € n}. 


The latter networks contain the former &49(@) C XII? (¢), by setting l; = 1 for 
allO < j < q1. We are going to prove a universality theorem first for the networks 
ETI” (@), and afterwards for the shallow FN networks £% (@). 


Definition 12.1 The function ¢@ : R — [0, 1] is called a squashing function if it is 
non-decreasing with limx—-—- (x) = 0 and limy-+o0 d(x) = 1. 


Since squashing functions can have at most countably many discontinuities, 
they are measurable; a continuous and a non-continuous example are given by the 
sigmoid and by the step function activation, respectively, see Table 7.1. 


Lemma 12.2 The sigmoid activation function is Lipschitz with constant 1/4. 


Proof The derivative of the sigmoid function is given by ¢’ = (1 — @¢). This 
provides for the second derivative 6” = ¢' — 2¢¢' = ¢’(1 — 2@). The latter is zero 
for d(x) = 1/2. This says that the maximal slope of @ is attained for x = 0 and it 
is (0) = 1/4. Oo 


We denote by C(R%) the set of all continuous functions from {1} x R to 
R, and by M(R?°) the set of all measurable functions from {1} x R® to R. If 
the measurable activation function @ is continuous, we have XII2°(¢) C C(R”), 
otherwise 217 (¢) C M(R”). 


Definition 12.3 A subset S C M(R*) is said to be uniformly dense on compacta 
in C(R?) if for every compact subset K C {1} x R® the set S is ox -dense in C(R?”) 
meaning that for all € > 0 and all g € C(R®) there exists f € S such that 


pK (g, f) = sup |g(x) — f Œ)| < €. 
xeK 


Theorem 12.4 (Theorem 2.1 in Hornik et al. [192]) Assume ¢ is a non-constant 
and continuous activation function. LII (p) C C(R®) is uniformly dense on 
compacta in C(R®). 


Proof The proof is based on the Stone—Weierstrass theorem. We briefly recall the 
Stone—Weierstrass theorem. Assume A is a family of real functions defined on a set 
E. A is called an algebra if it is closed under addition, multiplication and scalar 
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multiplication. A family A separates points in E, if for every x,z € E with x Æ z 
there exists a function A € A with A(x) # A(z). The family A does not vanish at 
any point of E if for all x € E there exists a function A € A such that A(x) Æ 0. 

Let A be an algebra of continuous real functions on a compact set K. The Stone- 
Weierstrass theorem says that if A separates points in K and if it does not vanish at 
any point of K, then A is pg -dense in the space of all continuous real functions on 
K. 

Choose any compact set K C {1} x R®. For any activation function ¢, XT (ġ) 
is obviously an algebra. So there remains to prove that this algebra separates points 
and does not vanish at any point. Firstly, choose x, z € K such that x Æ z. Since 
¢ is non-constant we can choose a,b € R such that d(a) # (b). Next choose 
A € A® such that A(x) = a and A(z) = b. Then, d(A(x)) 4 $(A(z)) and 
ETI? (h) separates points. Secondly, since ¢@ is non-constant, we can choose a € R 
such that ¢(a) 4 0. Moreover, choose weight w = (a,0,...,0)' € RO+!, Then 
for this A € AV, A(x) = (w, x) = a for any x € K. Henceforth, ¢(A(x)) Æ 0, 
therefore X TI% (ġ) does not vanish at any point of K. The claim then follows from 
the Stone—Weierstrass theorem and using that ¢ is continuous by assumption. o 


For Theorem 12.4 to hold, the activation function @ can be any continuous and 
non-constant function, i.e., it does not need to be a squashing function. This is 
fairly general, but it rules out the step function activation as it is not continuous. 
However, for squashing functions continuity is not needed and one still receives 
the uniformly dense on compacta property of XII?°(@) in C(R®), this has been 
proved in Theorem 2.3 of Hornik et al. [192]. The following theorem also does not 
need continuity, i.e., we do not require ©79(¢) C CCR?) as @ only needs to be 
measurable (and squashing). 


Theorem 12.5 (Universality, Theorem 2.4 in Hornik et al. [192]) Assume ¢ is a 
squashing activation function. ZI (p) is uniformly dense on compacta in C(R?),. 


Sketch of Proof For the (continuous) cosine activation function choice cos(-), 
Theorem 12.4 applies to £ TI% (cos). Repeatedly applying the trigonometric identity 
cos(a) cos(b) = cos(a + b) — cos(a — b) allows us to rewrite any trigonometric 
polynomial lly cos(A j k(x)) as T a; cos(A;(x)) for suitable A; € A®, 
a, E€ R and T € N. This allows us to identify £% (cos) = XII (cos). As a 
consequence of Theorem 12.4, shallow FN networks £% (cos) are uniformly dense 
on compacta in C (R%). 

The remaining part relies on approximating the cosine activation function. 
Firstly, Lemma A.2 of Hornik et al. [192] says that for any continuous squashing 
function y and any € > 0 there exists He(x) = ae Bio (wy + wi x) E€ E!(¢), 
x € R, such that 


sup |Y Œ) — He(x)| < €. (12.1) 
xeR 


For the proof we refer to Lemma A.2 of Hornik et al. [192], it uses that w is a 
continuous squashing function, implying that for every ô € (0, 1) there exists m > 0 
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such that y(—m) < 6 and wW(m) > 1 — ô. Approximation He € D! ($) of Y is then 
constructed on (—m, m) so that the error bound holds (and for ô sufficiently small). 
Secondly, choose € > 0 and M > 0, there exists cosm,e € =! (@) such that 


sup |cos(x) — cosm,e (x)| <€. (12.2) 
xe[-M,M] 


This is Lemma A.3 of Hornik et al. [192]; to prove this, we consider the cosine 
squasher of Gallant-White [150], for x € R 


1 3m 
x(x) = 3 (: + cos (: + =)) Li—-r/2<x<7/2} + lix>7x/2} € [0, 1]. 


This is a continuous squashing function. Adding, subtracting and scaling a finite 
number of affinely shifted versions of the cosine squasher x can exactly replicate 
the cosine on [—M, M]. Claim (12.2) then follows from the fact that we need a 
finite number of cosine squashers x to replicate the cosine on [—M, M], the triangle 
equality, and the fact that the (continuous) cosine squasher can be approximated 
arbitrarily well in a! (¢) using (12.1). 

The final step is to patch everything together. Consider Da æ cos(A;(x)) 
which approximates on the compact set K C {1} x R® a given continuous 
function g € C(R*%) with a given tolerance €/2. Choose M > 0 such that 
A:(K) C [-M, M] for all 1 < t < T. Note that this M can be found because 
K is compact, A; are continuous and T is finite. Define T’ = T E laz| < œ. 
By (12.2) we can then choose cosme/(2r') € =!) such that 


T T 
sup |X ær cos(Ar(x)) — Y- o cosm ejer) (Ar Œ))| < €/2. 
xE t=1 t=1 
This completes the proof. o 


12.2 Consistency and Asymptotic Normality 


Universality Theorem 12.5 tells us that we can approximate any compactly sup- 
ported continuous function arbitrarily well by a sufficiently large shallow FN 
network, say, with sigmoid activation function ¢. The next natural question is 
whether we can learn these approximations from data (Y;, x;);>1 that follow the true 
but unknown regression function x > uo(x), or in other words whether we have 
consistency for a certain class of learning methods. This is the question addressed, 
e.g., in White [379, 380], Barron [26], Chen—Shen [73], Déhler—Rtischendorf [109] 
and Shen et al. [336]. This turns the algebraic universality question into a statistical 
question about consistency. 
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Assume that the true data model satisfies 


Y = po(x) +e = E[Y |x] + e, (12.3) 


for a continuous regression function po : ¥ —> R on a compact set X C {1} x R®, 
and with a centered error ¢ satisfying E[|e|?+°] < oo for some ô > 0 and being 
independent of x. The question now is whether we can learn this (true) regression 
function uo from independent data (Y;, x;), 1 < i < n, obeying (12.3). Throughout 
this section we use the square error loss function L(y,a) = (y — a)’. For given 
data, this results in solving 


z ole SEE 
fin = argmin — L (Yi, u(r) = argmin =} O; — pa)? 02.4) 
pec(xy ” i=1 mec(x) ” i=1 


where C(4’) denotes the set of continuous functions on the compact set Y C 
{1} x R®. The main question is whether estimator fi, approaches the true regression 
function uo for increasing sample size n. 

Typically, the family of continuous functions C (æ) is much too rich to be able to 
solve optimization problem (12.4), and the solution may have undesired properties. 
In particular, the solution to (12.4) will over-fit to the data for any sample size 
n, and consistency will not hold, see, e.g., Section 2.2.1 in Chen [72]. Therefore, 
the optimization needs to be done over (well-chosen) smaller sets S, C C(%). 
For instance, S, can be the set of shallow FN networks having a maximal width 
qı = qi(n), depending on the sample size n of the data. Considering this regression 
problem in a non-parametric sense, we let grow these sets S, with the sample size 
n. This idea is attributed to Grenander [172] and it is called the method of sieve 
estimators of po. We define ford € N, A > 0, A > 0 and activation function (0) 


qı qo 
S(d, A, A, $) =; f E€ 2°); q =d, y IBjl< A, max X |w; j| <A 

: SIZ 

j=0 1=0 


These sets S(d, A, À, ġ) are shallow FN networks of a given width qı = d and with 
some restrictions on the network parameters.! We then choose increasing sequences 


1 The bound Eio lI8;| < Ain S(d, A, À, ġ) allows us to view this set of shallow FN networks 
as a symmetric convex hull of the family of functions So(@) = {x > o(A(x)); A € AP}, see 
Sect. 2.6.3 in Van der Vaart—Wellner [364]. If we choose an increasing activation function œ, this 
family of functions ¢ o A is a composition of a fixed increasing function ¢ and a finite dimensional 
vector space A? of functions A. This implies that So(@) is a VC-class saying that it has a finite 
Vapnik—Chervonenkis (VC) dimension [365]; see also Condition A and Theorem 2.1 in Déhler— 
Riischendorf [109]. This VC-class is an important property in many proofs as it leads to a finite 
covering (metric entropy) of function spaces, and this allows to apply limit theorems to point 
processes, we refer to Van der Vaart—Wellner [364]. 
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(dn)n>1, (An)n=1 and (An)n>1 which provides us with an increasing sequence of 
sieves (becoming finer as n increases) 


def. ~ def. ~ 
« © Salh) = S(dn, An, Ant) © Sei = Saat, Ansa: Apt 0) E 


The following corollary is a simple consequence of Theorem 12.5. 


Corollary 12.6 Assume ¢ is a squashing activation function, and let the increasing 
sequences (dn)n>1, (An)n>1 and (An)n>1 tend to infinity for n —> oo. Then 
Un>1 Sn () is uniformly dense in C(&). 


This corollary says that for any regression function uo € C(V’) we can find n € N 
and un E€ S,(@) such that un is arbitrarily close to uo; remark that all functions are 
continuous on the compact set X, and uniformly dense means p,y-dense in that case. 
Corollary 12.6 does not hold true if A, = A > 0, for all n. In that case we can only 
approximate the smaller function class U,,.1 Sn(¢@) C C(X). This is going to be 
used in one of the cases, below. 7 

For increasing sequences (dy)n>1, (An)n>1 and (et we define the sieve 
estimator (Mn)n>1 by 


= Me 
fin = argmin — } L(¥;, W(x). (12.5) 
eS (bo) P Gay 


Under the following assumptions one can prove a consistency theorem. 


Assumption 12.7 Choose a complete probability space (Q, A, P)? and X = {1} x 
[0, 1]%. 


(1) Assume uo € C(X). Assume (Y;, X;)j>1 are iid. on (Q, A, P) following the 
regression structure (12.3) with £; being centered, having E[le;|2+9] < oo for 
some ô > 0 and being independent of X;. Set o? = Var(s;) < 00. 

(2) The activation function ġ is the sigmoid function. 

(3) The sequences (dn)n>1, (An)n>1 and (Ae are increasing and tending to 
infinity as n —> oo with dn Ae log(d, Ay) = o(n). 


Most results that we are going to present below hold for activation functions that 
are Lipschitz. The sigmoid activation function is Lipschitz, see Lemma 12.2. 
The following considerations are based on the pseudo-norm, given (X;)1<j<n, 


for u E€C(X). 


1 n 
lula = |79 ean)? 


i=l 


2 A probability space (Q, A, P) is complete if for any P-null set B € A with P[B] = 0 and every 
subset A C B it follows that A € A. 
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This is a pseudo-norm because it is positive ||j||, > 0, absolutely homogeneous 
laul, = lal ||ul|,, and the triangle inequality holds, but it is not definite because 
|||, = 0 does not imply that u is the zero function (i.e. it is not point-separating). 
This pseudo-norm ||-||,, depends on the (random) features (X;)1<i<n and, therefore, 
the subsequent statements involving this pseudo-norm hold in probability. The 
following result provides consistency, and that the true regression function po, 
indeed, can be learned from i.1.d. data. 


Theorem 12.8 (Consistency, Theorem 3.1 of Shen et al. [336]) Under Assump- 
tion 12.7, the sieve estimator ({in)n>1 in (12.5) exists. We have consistency 
in — Moll, —> 0 in probability as n > œ, i.e., for alle > 0 


lim P [Zn = polla > e] =0. 
noo 


Remarks 12.9 


e Such a consistency result for FN networks has first been proved in Theorem 3.3 
of White [380], however, on slightly different spaces and under slightly different 
assumptions. Similar consistency results have been obtained for related point 
process situations by Dohler—Riischendorf [109] and for time-series in White 
[380] and Chen—Shen [73]. 

e Item (3) of Assumption 12.7 gives upper complexity bounds on shallow FN 
networks as a function of the sample size n of the data, so that asymptotically 
they do not over-fit to the data. These bounds allow for much freedom in the 
choice of the growth rates, and different choices may lead to different speeds of 
convergence. The conditions of Assumption 12.7 are, e.g., satisfied for A, = 
O(logn) and dn = O(n!™®"), for any small 5’ > 0. Under these choices, the 
complexity d, of the shallow FN network grows rather quickly. Table 1 of White 
[380] gives some examples, for instance, if for n = 100 data points we have a 
shallow FN network with 5 neurons, then these magnitudes support 477 neurons 
for n = 10/000 and 45’600 neurons for n = 1000000 data points (for the 
specific choice 5’ = 0.01). Of course, these numbers do not provide any practical 
guidance on the selection of the (shallow) FN network size. 

e Theorem 12.8 requires that we can explicitly calculate the sieve estimator 
fin, i.e., the global minimizer of the objective function in (12.5). In practical 
applications, relying on gradient descent algorithms, typically, this is not the case. 
Therefore, Theorem 12.8 is mainly of theoretical value saying that learning the 
true regression function uo is possible within FN networks. 


Sketch of Proof of Theorem 12.8 The proof of this theorem is based on a theorem 

in White—Woolridge [381] which states that if we have a sequence (S,(@))n>1 of 

compact subsets of C(¥), and if Ln : Q x S,(¢) > Ris aA 8 B(Sn(¢))/B(R)- 

measurable sequence, n > 1, with L, (œ, -) being lower-semicontinuous on S, (¢) 

for all w € Q. Then, there exists i, : Q —> Sn (Q) being A/B(S,(¢))-measurable 

such that for each w € Q, Ln (Œ, fin(@)) = oe (œw, u). For the proof of the 
MEOn 
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compactness of S,(@) in C(X¥) we need that d, and A,, are finite for any n. This 
then provides the existence of the sieve estimator, for details we refer Lemma 2.1 
and Corollary 2.1 in Shen et al. [336]. The proof of the consistency result then uses 
the growth rates on (da)n>1 and (Ay)y>1, for the details of the proof we refer to 
Theorem 3.1 in Shen et al. [336]. oO 


The next step is to analyze the rates of convergence of the sieve estimator 
lin —> Ho, asn — œœ. These rates heavily depend on (additional) regularity 
assumptions on the true regression function uo € C(A’); we refer to Remark 3 
in Sect. 5 of Chen—Shen [73]. Here, we present some results of Shen et al. [336]. 
From the proof of Theorem 12.8 we know that S, (Ø) is a compact set in C(’). This 
motivates to consider the closest approximation mnu E€ S,(P) to u € C(¥). The 
uniform denseness of J, Sn ($) in C(4) implies that „u converges to u. The 
aforementioned rates of convergence of the sieve estimators will depend on how fast 
Ino E Sn (Q) converges to the true regression function Wo € C(¥). 

If one cannot determine the global minimum of (12.5), then often an accurate 
approximation is sufficient. For this one introduces an approximate sieve estimator. 
A sequence (én )n>1 is called an approximate sieve estimator if 


1X 4 2 ap lI% 4 
-J i-i < inf -9 Yi- XD? + Op(m), 02.6) 
ns KESO n 


where (Nn)n>1 is a positive sequence converging to 0 as n — oo. The last term 
O p(n) denotes stochastic boundedness meaning that for all € > 0 there exits Ke > 
0 such that for all n > 1 


ee ee ee 
el Ede fin(Xi))? inf — 0 O- (Xi) > Kn] <e 


i=] UESn ($) n = 


Theorem 12.10 (Theorem 4.1 of Shen et al. [336], Without Proof) Set Assump- 
tion 12.7. If 


. 2 dnlog(dnrAn) dnlogn 
Nn = O | min į ||Tnuo — Loll;. r o , 


n 


the following stochastic boundedness holds for n > 1 


ve dn logn 
ln — Holla = Op | max į || 0 — Holla, : 


Remarks 12.11 


e Assumption 12.7 implies that d, log(d, A,) = o(n) asn — oo. Therefore, n, > 
Oasn > co. 
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¢ The statement in Theorem 4.1 of Shen et al. [336] is more involved because it 
is stated under slightly different assumptions. Our assumptions are sufficient for 
having consistency of the sieve estimator, see Theorem 12.8, and making these 
assumptions implies that the rate of convergence in Theorem 12.10 is determined 
by the rate of convergence of ||77; uo — uolln and (n—!dy, logn) 1/2 see Remark 4.1 
in Shen et al. [336]. 

e The rate of convergence in Theorem 12.10 crucially depends on the rate 
|7no — Holla, as n — ov. If uo lies in the (sub-)space of functions with 
finite first absolute moments of the Fourier magnitude distributions, denoted by 
F(X) C C(4X), Makavoz [262] has shown that ||, uo — Hollan decays at least as 
d; GtD/Cq0) — q; 1/21/0240) | this has improved the rate of dp 1/2 obtained by 
Barron [25]. This space F(X) allows for the choices dp = (n/logn)49/C+40), 
A, = A > Oand An = A > 0 to receive consistency and the following rate of 
convergence, see Chen—Shen [73] and Remark 4.1 in Shen et al. [336], 


Zn — nolla = Op’), 


for 


n (qo+1)/(4qo+2) 
rn = (=) n> 2. (12.7) 
logn 


Note that 1/4 < (qo + 1)/(4¢o0 + 2) < 1/2. Thus, this is a slower rate than the 
square root rule of typical asymptotic normality, for instance, for qo = 1 we get 
1/3. Interestingly, Barron [26] proposes the choice dyn ~ (n/logn)!/? to receive 
an approximation rate of (n/logn)~!/4. 

Also note that the space F(X) allows us to choose a finite A, = A > 0 
in the sieves, thus, here we do not receive denseness of the sieves in the space 
of continuous functions C(4’), but only in the space of functions with finite first 
absolute moments of the Fourier magnitude distributions F(V). 


The last step is to establish the asymptotic normality. For this we have to define 
perturbations of shallow FN networks u € S(p). Choose nn € (0, 1) and define 
the function 


esi 1/2 1/2 
jin (1) = (A — nal?) + nn’? (uo + 1). 


This allows us to state the following asymptotic normality result. 


Theorem 12.12 (Theorem 5.1 of Shen et al. [336], Without Proof) Set Assump- 
tion 12.7. We make the following additional assumptions: suppose nn = 0(n~!) and 
choose o, such that we have stochastic boundedness On |n — Holla = Op (1). Let 
the following conditions hold: 


(C1) dn An log(d, An) == o(n!/®; 
(C2) noz?/A =o(1); 


546 12 Appendix A: Technical Results on Networks 


(C3) Sup... (9):hye—-polle o7! nin W) — in (4) [In = OP (enm); 


(C4) SUP eS, ¢@):Iju-polle cox! E Diet Ei nAn X) — Tn XD) = Op (mn). 


We have the following asymptotic normality for n — oo 
1 n 
A 2 
a 3 Gin(Xi) — wo(Xi)) => N (0, 0°). 


The assumptions of Theorem 12.12 require a slower growth rate d on the 
shallow FN network compared to the consistency results. Shen et al. [336] bring 
forward the argument that for the asymptotic normality result to hold, the shallow 
FN network should grow slower in order to get the Gaussian property, otherwise the 
sieve estimator may skew towards the true function uo. Conditions (C3)—(C4) on 
the other side give lower growth rates on the networks such that the approximation 
error decreases sufficiently fast. 

If the variance parameter o? = Var(e;) is not known, we can empirically estimate 
it 

a9: > 1 : Y -TX 
=- i — Un (Xi). 


i=1 


Theorem 5.2 in Shen et al. [336] proves that this estimator is consistent for 
o”, and the asymptotic normality result also holds true under this estimated 
variance parameter (using Slutsky’s theorem), and under the same assumptions as 


in Theorem 12.12. 


12.3 Functional Limit Theorem 


Horel—Giesecke [190] push the above asymptotic results even one step further. Note 
that the asymptotic normality of Theorem 12.12 is not directly useful for variable 
selection, since the asymptotic result integrates over the feature space X. Horel- 
Giesecke [190] prove a functional limit theorem which we briefly review in this 
section. 

A qo-tuple a = (a@,..., gy)! E Ng is called a multi-index, and we set |a| = 
a1 +...+ Qq. Define the derivative operator 


alel 
VĚ = —— 
ax”... Ax_% 
1 qo 
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Consider the compact feature space ¥ = {1} x [0, 1]” with go > 3. Choose a 
distribution v on this feature space X and define the L?-space 


ine v) = fu : X — R measurable; zy [u(X)’] = / u(x) dv(x) < o] : 
x 
Next, define the Sobolev space for k € N 
wk? (X, v) = [u E L*(¥,v); V*u € L’ (X, v) forall a € NV with |a| < k} , 


where V° is the weak derivative of u. The motivation for studying Sobolev 
spaces is that for sufficiently large k and the existence of weak derivatives V“ u € 
L? (X, v), |a| < k, we eventually receive a classical derivative of u, see below. We 
define the Sobolev norm for u € Ww2(X, v) by 


1/2 


lulk = | X E [Vu] 


|a|<k 


The normed Sobolev space (W*? (4, p), ||- llk,2) is a Hilbert space. Since we would 
like to consider gradient-based methods, we consider the following space 


CLA. v) = fu : X — R continuously differentiable; ||ull1go/2]+2,2 < B} ; 
(12.8) 


for some positive constant B < oo. We will assume that the true regression function 
Ho € c$ (X, v), thus, the true regression function has a bounded Sobolev norm 
II- Illgo/2]+2,2 Of maximal size B. Assume that X c R% is the open interior of X 
(excluding the intercept component), and that v is absolutely continuous w.r.t. the 
Lebesgue measure with a strictly positive and bounded density on æ (excluding 
the intercept component). The Sobolev number of the space W/40/ 2142.29 ,V) is 
given by m = |q0/2] + 2 — qo/2 > 1.5 > 1. The Sobolev embedding theorem 
then tells us that for any function u € W!40/ 21+2,2(% , v), there exists an |m ]- 
times continuously differentiable function on X that is equal to u a.e., thus, the 
class of equivalent functions  € Wl4o/ 2]+2,2(% , v) has a representative in C l (% ), 
|m] = 1, this motivates the consideration of the space in (12.8). 

In practice, the bound B needs a careful consideration because the true po is 
unknown. Therefore, B should be sufficiently large so that zo is contained in the 
space C 2 (X, v) and, on the other hand, it should not be too large as this will weaken 
the power of the tests, below. 
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We choose the sigmoid activation function for ¢ and we consider the approximate 
sieve estimators ({in)n>1 for given data (Y;, X;); obtained by a solution to 


n 


i 5 1 A 
— ` (Yi —Un(Xi))° < inf -— > (Yi — u(Xj))° + 0p (1), (12.9) 
n =] HESn(b) n = 


where we allow for an error term op(1) that converges in probability to zero as 
n —> oo. In contrast to (12.6) we do not specify the error rate, here. 


Assumption 12.13 Choose a complete probability space (Q, A, P) and X = {1} x 
[0, 1]. 


(1) Assume uo € Coe: v) for some B > O, and (Yi, Xi)i>ı are iid. on 
(Q, A, P) following regression structure (12.3) with £; being centered, having 

D[|e;|°+°] < co for some ô > 0, being absolutely continuous w.rt. the Lebesgue 
measure, and being independent of X;; the features Xi ~ v are absolutely 
continuous w.r.t. the Lebesgue measure having a bounded and strictly positive 
density on X (excluding the intercept component). Set o? = Var(e;) < 00. 

(2) The activation function @ is the sigmoid function. 

(3) The sequence (dn)n>1 is increasing and going to infinity satisfying 


dat! log(dy) = O(n) asn > œ, and An = A>0, A, =A>0 
forn>1. 

(4) Define Ly (X, £) = —2e(u(X) — wo(X)) + (U(X) — W(X), and it holds for 
n>2 


1 n 
Fi >> (Lai, (Xi, £i) — Ev [Lam (X1, €1)]) 
i=l 


1 n 
< inf = Y (Lyoth/rg (Xi, £i) — Ev [Lygth/r, (X1; €1)]) + op, ), 
ae Loth/r, i, Ei ol o+h/r, ) n 


for rn being the rate defined in (12.7). 


The first three items of this assumption are rather similar to Assumption 12.7 
which provides consistency in Theorem 12.8 and the rates of convergence in 
Theorem 12.10. Item (4) of Assumption 12.13 needs to be compared to (C3)- 
(C4) of Theorem 12.12 which is used for getting the asymptotic normality. (fn)n 
is the rate that provides convergence in probability of the sieve estimator to the true 
regression function, and this magnitude is used for the perturbation, see also (C3)— 
(C4) in Theorem 12.12. 


Theorem 12.14 (Asymptotics, Theorem 1 of Horel—Gisecke [190], Without 
Proof) Under Assumption 12.13 the approximate sieve estimator (fin)n>1 (12.9) 
converges weakly in the metric space (C$ (X, v), dy) with dy (u, u) = E,[(u(X) — 
wX’: 


Tn (tn = uo) > u asn — œ, 
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where u* is the arg max of the Gaussian process {Gy; u € C} (X, v)} with mean 
zero and covariance function Cov(Gy, Gw) = 40? ty lu (X) uw’ (X)]. 


Remarks 12.15 We highlight the differences between Theorems 12.12 and 12.14. 


e Theorem 12.12 provides a convergence in distribution to a Gaussian random 
variable, whereas the limit in Theorem 12.14 is a random function x > u*(x) = 
u% (x), œ € Q, thus, the former convergence result integrates over the (empirical) 
feature distribution, whereas the latter also allows for a point-wise consideration 
in feature x. 

¢ The former theorem does not allow for variable selection in X whereas the latter 
does because the limiting function still discriminates different feature values. 

e For the proof of Theorem 12.14 we refer to Horel—Giesecke [190]. It is based 
on asymptotic results on empirical point processes; we refer to Van der Vaart- 
Wellner [364]. The Gaussian process {G3 u € ch (X, v)} is parametrized by the 
(totally bounded) space Ci (X, v), and it is continuous over this compact index 
space. This implies that it takes its maximum. Uniqueness of the maximum then 
gives us the random function u* which exactly describes the limiting distribution 
of rn (fin — uo) as n > œ. 


12.4 Hypothesis Testing 


Theorem 12.14 can be used to provide a significance test for feature component 
selection, similarly to the LRT and the Wald test presented in Sect. 5.3.2 on GLMs. 
We define gradient-based test statistics, for 1 < j < qo, and w.r.t. the approximate 
sieve estimator fin € Sn ($) given in (12.9), 


A 2 n px 2 
AM = f Pi dv(x) and LOS S3 Hn (Xi) : 
i x Ox; J n Z Ox; 


The test statistics AY? integrates the squared partial derivative of the sieve estimator 


jin w.r.t. the distribution v, whereas A” can be considered as its empirical 
counterpart if X ~ v. Note that both test statistics depend on the data (Y;, X i)l<i<n 
determining the sieve estimator jin, see (12.9). These test statistics are used to test 
the following null hypothesis Ho against the alternative hypothesis Hı for the true 
regression function uo € cr (¥, v) 


2 
Hy iAyp=E, (2%) |-0 against Hi : àj #0. (12.10) 
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We emphasize that the expression A; in (12.10) is a deterministic number, for this 


reason we use the expected value notation E,,[-]. This in contrast to aW, which is 
only a conditional expectation, conditionally given the data (Y;, X;)1<j<n. 


Proposition 12.16 (Theorem 2 and Proposition 3 of Horel—Giesecke [190], 
Without Proof) Under Assumption 12.13 and under the null hypothesis Ho we 
have for n —> œ 


A au* 2 
nA aa a = ( nO) dv(x). (12.11) 


In order to use this proposition we need to be able to calculate the limiting 
distribution characterized by random variable W;. The maximal argument j* of 
the Gaussian process {Gy,; u € CL (%¥, v)} is given by a random function such that 
for all w € Q, u4, (-) fulfills 


Gio (o) > Gulo) for all u € C$ (X, v). 


A discretization and simulation approach can be explored to approximate this 
maximal argument u* for different w € Q, see Section 5.7 in Horel—Giesecke [190]. 


1. Sample random functions fg from Chin ,v), k > 1. The universality the- 
orems suggest that we sample these random functions fg from the sieves 
(Sp AC b (¥,v))n>1. This requires sampling dimension qı of the shallow FN 
network and the corresponding network weights. This provides us with candidate 
functions f},..., fxr EC t (X, v), these candidate functions can be understood 
as a random covering of the (totally bounded) index space ch (¥, v). 

2. Simulate K-dimensional multivariate Gaussian random variables G® (i.i.d.) 
with mean zero and (empirical) covariance matrix 


PN 1 n 

i= G WD raona) 
i=l 1<k,l<K 

These random variables G®,..., GO play the role of discretized random 


samples of the Gaussian process {G 4; u € E (Xv, v)}. 
3. The empirical arg max of the sample G, 1 < t < T, is obtained by 


where Gg? is the k-th component of G™. 


12.4 Hypothesis Testing 551 


4. The empirical distribution of the following sample P”, 1 <t < T, gives us an 
approximation to the limiting distribution in Proposition 12.16 


go oly “= 
Ox j i 


i.e., under the null hypothesis Hp we approximate the right-hand side of (12.11) 
by the empirical distribution of EO) t<T- 


We close this section we some remarks. 
Remarks 12.17 


e The quality of the empirical approximation EO) 1<t<T to the limiting distribu- 
tion of Y; will depend on how well we cover the index set Ct (XY, v). We could 
try to use covering theorems to control the accuracy. However, this is often too 
challenging. The simulation approach presented above suffers from not giving 
us any control on the quality of this covering, nor is it clear how the Sobolev 
norm condition for B in (12.8) can efficiently be checked during the simulation 
approach. We highlight that this Sobolev norm bound || fk ll1go/2]+2,2 < B is 
crucial when we want to empirically estimate the distribution of Yj; under 
special assumptions Horel—Giesecke [190] prove in their Theorem 4 that Y; 
scales as B*. Thus, if we do not have any control over the Sobolev norm of the 
sampled shallow FN networks fg, the above simulation algorithm is not useful to 
approximate the limiting distribution in Proposition 12.16. 

e The assumptions of Proposition 12.16 require that X ~ v has a strictly positive 
density over the entire feature space X (excluding the intercept component). This 
is necessary to be able to capture any non-zero partial derivative 0 uo(x)/ðxj over 
the entire feature space X. In practical applications, where we rely on a finite 
sample (X;)1<i<n, this may be problematic and needs some care. For instance, 
there may be the situation where the samples cluster in two disjoint regions, say 
Cı C & and C2 C X, because we may have v(C1 U C2) ~ 1. That is, in that 
case we rarely have observations X; not lying in one of these two clusters. If 
d40(x)/dx; = 0 on these two clusters x € Cı U C2, but if zo has a very steep 
slope between the two clusters (i.e., if they are really different in terms of uo), 
then the test on this finite sample will not find the significant slope. 

e The distribution X ~ v of the features is assumed to be absolutely continuous on 
the hypercube [0, 1], this is not fulfilled for binary and categorical features. 

e Another question is how the test of Proposition 12.16 is affected by collinearity in 
feature components. Note that we only test one component at a time. Moreover, 
we would like to highlight the j-dependency in the limiting random variable W;. 
This dependency is induced by the properties of the feature distribution v that 
may not be exchangeable in the components of x. 
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This appendix presents and describes the data sets used. 


13.1 French Motor Third Party Liability Data 


We consider a French motor third party liability (MTPL) claims data set. This data 
set is available through the R library CASdatasets! being hosted by Dutang— 
Charpentier [113]. The specific data sets chosen from CASdatasets are called 
FreMTPL2freq and FreMTPL2sev, the former contains the insurance policy 
and claim frequency information and the latter the corresponding claim severity 
information.” 

Before we can work with this data set we perform data cleaning. It has been 
pointed out by Loser [259] that the claim counts on the insurance policies with 
policy IDs < 24500 in FreMTPL2fregq do not seem to be correct because these 
claims do not have claim severity counterparts in FreMTPL2sev. For this reason 
we work with the claim counts extracted from the latter file. In Listing 13.1 we give 
the code used for data cleaning.’ In this code we merge FreMTPL2f req with the 
aggregated severities on each insurance policy and the corresponding claim counts 
are received from FreMTPL2sev, this is done on lines 2-11 of Listing 13.1. A 


' casdatasets website: http://cas.uqam.ca/. 

2 We use CASdatasets version 1.0-8 which has been packaged on 2018-05-20. This version 
uses for the 22 French regions the labels R11, ..., R94. In later versions of CASdatasets these 
labels have been replaced by the region names, in this transformation the labels R31 (Nord-Pas- 
de-Calais) and R41 (Lorraine) have been merged to one region called Nord-Pas-de-Calais. We 
believe that this is an error and therefore prefer to work with an older version of CASdatasets. 
This older version can be downloaded in R with library (OpenML), library (farff), 
freMTPL2freq <- getOMLDataSet (data.id = 41214)S$data 


3 The code in Listing 13.1 is a modified version of the R code provided by Loser [259]. 
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further inspection of the data indicates that policies with more than 5 claims may be 
data error because they all seem to belong to the same driver (and they have very 
short exposures).* For this reason we drop these records on line 12. On line 13 we 
censor exposures at one accounting year (since these policies are active within one 
calendar year). Finally, on lines 15-16 we re-level the VehBrands.> All subsequent 
analysis is based on this cleaned data set. 


Listing 13.1 Data cleaning applied to the French MTPL data set 


# 

data (freMTPL2freq) 

dat <- freMTPL2freq[, -2] 

dat$VehGas <- factor(dat$VehGas) 

data (freMTPL2sev) 

sev <- freMTPL2sev 

sevSClaimNb <- 1 

dat0 <- aggregate (sev, by=list (IDpol=sev$IDpol), FUN = sum) [c(1,3:4)] 

names (dat0) [2] <- "ClaimTotal" 

dat <- merge(x=dat, y=dat0, by="IDpol", all.x=TRUE) 

dat [is.na(dat)] <- 0 

dat <- dat [which (dat$ClaimNb <=5),] 

dat$Exposure <- pmin(dat$Exposure, 1) 

sev <- sev [which (sev$IDpol %in% dat$IDpol), c(1,2) 

dat$VehBrand <- factor(dat$VehBrand, levels=c("B1","B2","B3","B4","B5","B6", 
"B10", "B11","B12","B13","B14")) 


Listing 13.2 Excerpt of the French MTPL data set 


‘data.frame’: 678007 obs. of 13 variables: 
$ IDpol > num 1 3:5 20 11 13 15 17 18 21 
$ Exposure $num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 
$ Area : Factor w/ 6 levels "A","B","C","D",..: 4422255332 
$ VehPower : int 5567766777 
$ VehAge : int 0020022000 
$ DrivAge : int 55 55 52 46 46 38 38 33 33 41 
$ BonusMalus: int 50 50 50 50 50 50 50 68 68 50 
$ VehBrand : Factor w/ 11 levels "B1","B2","B3",..: 9999999999 
$ VehGas : Factor w/ 2 levels "Diesel","Regular": 2211122111 
$ Density : int 1217 1217 54 76 76 3003 3003 137 137 60 ... 
$ Region : Factor w/ 22 levels "R11","R21","R22",..: 18 18 3 15 15 8 8 
$ ClaimTotal: num 0000000000 
$ ClaimNb : num 0000000000 
HHHH 
‘data.frame’: 26383 obs. of 2 variables: 
$ IDpol : int 1552 1010996 4024277 4007252 4046424 4073956 4012173 


$ ClaimAmount: num 995 1128 1851 1204 1204 


Listing 13.2 gives an excerpt of the cleaned French MTPL data set, lines 2- 
14 give the insurance policy and claim counts information, and lines 17-18 


4 Short exposure policies may also belong to a commercial car rental company. 

5 The data set FreMTPLfreg of CASdatasets is a subset of FreMTPL2freq with slightly 
changed feature components, for instance, the former data set contains car brand names in a more 
aggregated version than the latter, see Table 13.2, below. 
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display the individual claim amounts. We have 9 feature components on lines 4— 
12 (1 component is binary, 3 components are categorical, and 5 components are 
continuous), an exposure variable on line 3, and claim information on lines 13—14 
and 18. In total we have 26’383 claims on 678’007 insurance policies. 

We start by giving a descriptive analysis of the data, this closely follows Noll et 
al. [287]. We have the following insurance policy information: 


IDpol: policy number (unique identifier); 
Exposure: total exposure in yearly units (years-at-risk) and within (0, 1]; 
Area: area code (categorical, ordinal with 6 levels); 
VehPower: power of the car (continuous); 
VehAge: age of the car in years; 
DrivAge: age of the (most common) driver in years; 
BonusMalus: bonus-malus level between 50 and 230 (with entrance level 
100); 
8. VehBrand: car brand (categorical, nominal with 11 levels), see also 
Table 13.2; 
9. VehGas: diesel or regular fuel car (binary); 
10. Density: density of population per km? at the location of the living place of 
the driver; 
11. Region: regions in France (prior to 2016), see also Fig. 13.1 (categorical). 


SO SE Sek 


We start by describing the Exposure. The Exposure measures the duration of 
an insurance policy in yearly units; sometimes it is also called years-at-risk. The 
shortest exposure in our data set is 0.0027 which corresponds to | day, and the 
longest exposure is 1 which corresponds to | year. Figure 13.2 (lhs, middle) shows 
a histogram and a boxplot of these exposures. In view of the histogram we conclude 
that roughly 1/4 of all policies have a full exposure of 1 calendar year, and all 
other policies are only partly exposed during the calendar year. From a practical 
insurance point of view this high ratio of partly exposed policies seems rather 


Fig. 13.1 The 22 regions in 22 French regions from 1982-2015 
France between 1982 and 
2015 
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histogram of Exposures (678007 policies) boxplot of Exposures (678007 policies) histogram of claim numbers 
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Fig. 13.2 (lhs) Histogram of Exposure, (middle) boxplot of Exposure, (rhs) number of 


observed claims ClaimNb of the French MTPL data 


nen ae hee of es g Number of claims |0 1 2 3 14 15 

t rt. t 

Lime Number of policies | 653°069 | 23°571 |1298 |62 |5 |2 
Total exposure 341090 | 16315 | 909 42 |2 |1 


unusual. A further inspection of the data indicates that policy renewals during the 
year account for two separate records in the data set. Of course, such split policies 
should be merged to one yearly policy. Unfortunately, we do not have the necessary 
information to perform this merger, therefore, we need to work with the data as it is. 
In Table 13.1 and Fig. 13.2 (rhs) we split the portfolio w.r.t. the number of claims. 
On 653’069 insurance policies (amounting to a total exposure of 341’090 years- 
at-risk) we do not have any claim, and on the remaining 24’938 policies (17°269 
years-at-risk) we have at least one claim. The overall portfolio claim frequency 
(w.r.t. Exposure) is A = 7.35%. 

We study the split of this overall frequency 1 = 7.35% across the different 
feature levels. This empirical analysis is crucial for the model choice in regression 
modeling. For the empirical analysis we provide 3 different types of graphs for each 
feature component (where applicable), these are given in Figs. 13.3, 13.4, 13.5, 13.6, 
13.7, 13.8, 13.9, 13.10, and 13.11. The first graph (lhs) gives the split of the total 
exposure to the different feature levels, the second graph (middle) gives the average 
feature value in each French region (green meaning low and red meaning high),’ 
and the third graph (rhs) gives the observed average frequency per feature level. This 
observed frequency is obtained by dividing the total number of claims by the total 
exposure per feature level. The frequencies are complemented by confidence bounds 
of two standard deviations (shaded area). These confidence bounds correspond to 
twice the estimated standard deviations. The standard deviations are estimated under 


6 The empirical analysis in these notes differs from Noll et al. [287] because data cleaning has been 
done differently here, we refer to Listing 13.1. 

7 We acknowledge the use of UNESCO (1987) database through UNEP/GRID-Geneva for the 
French map. 
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Fig. 13.3 (lhs) Histogram of exposures per Area code, (middle) average Area code per 
Region, we map (A,..., F) + (1,..., 6), (rhs) observed frequency per Area code 
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Fig. 13.4 (lhs) Histogram of exposures per VehPower, (middle) average VehPower per 
Region, (rhs) observed frequency per VehPower 
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Fig. 13.5 (lhs) Histogram of exposures per VehAge (censored at 20), (middle) average VehAge 
per Region, (rhs) observed frequency per VehAge 
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Fig. 13.6 (lhs) Histogram of exposures per DrivAge (censored at 90), (middle) average 
DrivAge per Region, (rhs) observed frequency per DrivAge (y-scale is different compared 
to the other frequency plots) 
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Fig. 13.7 (lhs) Histogram of exposures per BonusMalus level (censored at 150), (middle) 
average BonusMalus level per Region, (rhs) observed frequency per BonusMalus level (y- 
scale is different compared to the other frequency plots) 


a Poisson assumption, thus, they are obtained by +2,/A,/Exposure;, where 


Ax is the observed frequency and Exposure; is the total exposure for a given 
feature level k. We note that in all frequency plots the y-axis ranges from 0% to 
20%, except in the BonusMalus plot where the maximum is set to 60%, and the 
DrivAge plot where the maximum is set to 40%. From these plots we conclude 
that some levels have only a small underlying Exposure; BonusMalus leads to 
the highest variability in frequencies followed by DrivAge; and there is quite some 
heterogeneity. 

Table 13.2 gives the assignment of the different VehBrand levels to car 
brands. This list has been compiled from the two data sets FreMTPLfreq 
and FreMTPL2freg contained in the R package CASdatasets [113], see 
Footnote 5. 

Next, we analyze collinearity between the feature components. For this we calculate 
Pearson’s correlation and Spearman’s Rho for the continuous feature components, 
see Table 13.3. In general, these correlations are low, except for DrivAge 
vs. BonusMalus. Of course, the latter is very sensible because a BonusMalus 
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Fig. 13.8 (lhs) Histogram of exposures per VehBrand, 
VehBrand; for VehBrand assignment we refer to Table 13.2 
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Fig. 13.9 (lhs) Histogram of exposures per VehGas, (middle) average VehGas per Region 
(diesel is green and regular red), (rhs) observed frequency per VehGas 
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Fig. 13.10 (lhs) Histogram of exposures per population Density (on log-scale), (middle) 
average population Density per Region, (rhs) observed frequency per population Density; 
in general, we always consider Density on the log-scale 
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Fig. 13.11 (lhs) Histogram of exposures Exposure, and (middle, rhs) observed claim frequen- 
cies per Region in France (prior to 2016) 


Table 13.2 VehBrand 
assignment 


B1/B2 
B3 
B4/B5 
B6 
B10/B11 
B12 
B13 / B14 


Renault, Nissan and Citroën 

Volkswagen, Audi, Skoda and Seat 

Opel, General Motors and Ford 

Fiat 

Mercedes, Chrysler and BMW 

Japanese (except Nissan) and Korean cars 


Other cars 


Table 13.3 Correlations in feature components: top-right shows Pearson’s correlation; bottom- 
left shows Spearman’s Rho; Density is considered on the log-scale; significant correlations are 


boldface 
VehPower 0.01 
VehAge —0.10 
DrivAge i i F —0.05 
BonusMalus |-0.07 | oos [-057 | | 0.13 
Density 


level below 100 needs a certain number of driving years without claims. We give the 
corresponding boxplot in Fig. 13.12 (Ihs) which confirms this negative correlation. 
Figure 13.12 (rhs) gives the boxplot of log-Density vs. Area code. From this 
plot we conclude that the area code has likely been set w.r.t. the log-Density. 
For our regression models this means that we can drop the area code information, 
and we should only work with Density. Nevertheless, we will use the area code 
to show what happens in case of collinear feature components, i.e., if we replace 
(A,..., F) => d,...,6). 

Figure 13.13 illustrates each continuous feature component w.r.t. the different 
VehBrands. Vehicle brands B10 and B11 (Mercedes, Chrysler and BMW) have 
more VehPower than other cars, B10 being more likely a diesel car, and vehicle 
brand B12 (Japanese and Korean cars) has comparably new cars in more densely 
populated French regions. 
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Fig. 13.12 Boxplots (lhs) BonusMalus vs. DrivAge, (rhs) log-Density vs. Area code; 
these plots are inspired by Fig. 2 in Lorentzen—Mayer [258] 


More formally, the strength of dependence between categorical variables can be 
measured by Cramér’s V. Cramér’s V is based on the x?-test of independence 
on contingency tables. We briefly explain this. Assume we have two-dimensional 
categorical features x = (x1, x2) € ¥ having mı and mz levels, respectively. Let px 
describe the probability on ¥ that a randomly chosen insurance policy takes feature 
x, and let py; and py, be the marginal distributions of py. If the two components of 
x are independent with these two marginals, then we have special (independence) 
distribution 


Tx = Px; Px for all x = (x1, x2) E€ X. 


The x?-test for independence now analyzes py vs. mx. Assume we have n 
observations. Denote by ny = ny,,x, the number of instances that have feature 
x = (x1, x2), and let ny,,. and n. x, be the corresponding marginal observations. 
The yx2-test statistics is given by 


Tix], axa 
x=(x1,x2)E¥ n 


Under the null hypothesis of having independence between the components of x, 
the test statistics x? converges in distribution to a x7-distribution with (mm — 1) 
degrees of freedom if we let the number of independently drawn instances go to 
infinity. Seven different proofs of this statement are given in Benhamou—Melot [30]. 
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Fig. 13.13 Distribution of the variables VehPower, VehAge, DrivAge, BonusMalus, log- 
Density, VehGas for each car brand VehBrand, individually 
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Table 13.4 Cramér’s V for the categorical feature components vs. the categorized continuous 
components 
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Fig. 13.14 VehBrands in the different French Regions 


We scale the test statistics to the interval [0, 1] by dividing it by the comonotonic 
(maximal dependent) case and by the sample size n. This motivates Cramér’s V 


2 
V= pa o e [0,1]. 
min{m, — 1, m2 — 1} 


Section 7.2.3 of Cohen [78] gives a rule of thumb for small, medium 
and large dependence. Cohen [78] calls the association between x; and x2 
small if V./min{m, — l,m — I} is less 0.1, it is of medium strength for 
V./min{m, — I, mz — I} of size 0.3, and it is a large effect if this value is around 
0.5. Our results are presented in Table 13.4. Clearly, there is some association 
between VehBrand and both VehPower and VehAge, this can also be seen 
from Fig. 13.13, for the remaining variables the dependence is somewhat weaker. 
Not surprisingly, Cramér’s V shows the largest value between Region and log- 
Density. 

In Fig. 13.14 we show the VehBranads in the different French Regions, Cramér’s 
V is 0.13 for these two categorical variables, multiplying with ./11— 1 gives a 
value bigger than 0.4 which is a considerable association according to Cohen [78]. 
We note that in some regions the French car brands B1 and B2 are very dominant, 
whereas on the Isle of Corse (R94) 80% of the cars in our portfolio are Japanese 
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Fig. 13.15 Empirical density and log-log plots of the observed claim amounts 


or Korean cars B12. Our portfolio has its biggest exposure in Region R24, see 
Fig. 13.11, in this region French cars are predominant. 

Next, we study the claim sizes of this French MTPL example. Figure 13.15 shows 
the empirical density plot and the log-log plot. These two plots already illustrate the 
main difficulty we often face in claim size modeling. From the empirical density 
plot we observe that there are many payments of fixed size (red vertical lines) which 
do not match any absolutely continuous distribution function assumption. The log- 
log plot shows heavy-tailedness because we observe asymptotically a straight line 
with negative slope on the log-scale, this indicates regularly varying tails and, thus, 
the EDF is not a suitable model on the original observation scale. 

Figure 13.16 gives the boxplots of the claim sizes per feature level (we omit the 
claims outside the whiskers because heavy-tailedness would distort the picture). The 
empirical mean in orange is much bigger than the median in red color, which also 
expresses the heavy-tailedness. From these plots we conclude that the claim sizes 
seem less sensitive in feature values which may question the use of a regression 
model for claim sizes. 

Figure 13.17 shows the density plots for different feature levels. Interestingly, it 
seems that the features determine the sizes of the modes, for instance, if we focus 
on Area, Fig. 13.17 (top-left), we see that the area codes mainly influence the sizes 
of the modes. This may be interpreted by modes corresponding to different claim 
types which occur at different frequencies among the area codes. 


13.2 Swedish Motorcycle Data 


Our second example considers the Swedish motorcycle data which originally 
has been used in Ohlsson-Johansson [290]. It is available through the R library 
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Fig. 13.16 Boxplots of claim sizes per feature level: these plots omit the claims outside the 
whiskers; red color shows the median and orange color the empirical mean 


CASdatasets [113], and it is called 


swmotorcycle. Listing 13.3 shows the 


data cleaning that we have used, and Listing 13.4 gives an excerpt of the cleaned 


data. 


We briefly describe the data. The data considers comprehensive insurance for 
motorcycles. This covers loss or damage of motorcycles other than collision, e.g., 
caused by theft, fire or vandalism. The data considers aggregated claims on feature 
levels for years 1994-1998. We have claims on 656 out of the 62’036 different 
features, thus, only slightly more than 1% of all feature combinations suffer a claim 


in the considered period. 
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empirical density of claim amounts: Area code 
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Fig. 13.17 Empirical claim size densities split w.r.t. the different levels of the feature components 


We start by describing the available variables on lines 2—10 of Listing 13.4: 


1. OwnerAge: age of motorcycle owner in {18,..., 70} years (we censor at 70 
because of scarcity of data above); 


Ww N 


. Gender: gender of motorcycle owner either being Female or Male; 
. Area: 7 geographical Swedish zones being (1) central parts of Sweden’s three 


largest cities, (2) suburbs and middle-sized towns, (3) lesser towns except those 
in zones (5)-(7), (4) small towns and countryside except those in zones (5)-(7), 
(5) Northern towns, (6) Northern countryside, and (7) Gotland (Sweden’s largest 


island); 


4. RiskClass: 7 ordered motorcycle classes received from the so-called EV ratio 
defined as (Engine power in kW x 100) / (Vehicle weight in kg + 75kg); 


nN 


. VehAge: age of motorcycle in {0, ..., 30} years (we censor at 30); 
. BonusClass: ordered bonus-malus class from 1 to 7, entry level is 1; 


eee eee Ree ee 
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Listing 13.3 Data cleaning applied to the Swedish motorcycle data set 


library (CASdatasets) 

data (swmotorcycle) 

mcdata <- swmotorcycle 

mcdataSGender <- as.factor(mcdata$Gender) 

mcdataSArea <- as.factor(mcdata$Area) 

mcdataSArea <- factor (mcdata$Area, levels (mcdataSArea) [c(1,7,3,6,5,4,2)]) 


mcdataSArea <- c("Zone 1","Zone 2","Zone 3","Zone 4","Zone 5", 
"Zone 6","Zone 7") [as. integer (mcdata$Area) ] 
mcdataSArea <- as.factor (mcdataSArea) 


mcdataSRiskClass <- as.factor(mcdata$SRiskClass 

mcdataSRiskClass <- factor(mcdata$RiskClass, 
levels (mcdata$RiskClass) [c(1,6,7,3,4,5,2)]) 

mcdata$RiskClass <- as.integer(mcdata$RiskClass) 

mcdata$BonusClass <- as.integer(as.factor (mcdata$BonusClass) 

# 

mcdata <- mcdata [which (mcdata$OwnerAge>=18) ,] 

mcdataSOwnerAge <- pmin(70, mcdata$SOwnerAge) 

mcdata$VehAge <- pmin(30, mcdata$VehAge) 

mcdata <- mcdata[which(mcdataSExposure>0) ,] 


only minimal age 18 

set maximal age 70 

set maximal motorcycle age 30 
only positive exposures 


Se HHH 


Listing 13.4 Excerpt of the Swedish motorcycle data set 


‘data.frame’: 62036 obs. of 9 variables: 

$ OwnerAge : num 18 18 18 18 18 18 18 18 18 18 

$ Gender : Factor w/ 2 levels "Female","Male": 1 1 11111111 
$ Area : Factor w/ 7 levels "Zone 1","Zone 2",..: 11112 2 2 3 
$ RiskClass : int 1233113111 

$ VehAge : num 8 119 9 1112 24 466 

$ BonusClass : int 2234112112 

$ Exposure : num 1 0.778 0.499 0.501 0.929 

$ ClaimNb : int 0000000000 

$ ClaimAmount: int 0000000000 


7. Exposure: total exposure in yearly units, these exposures are aggregated for 
given feature combinations, resulting in total exposures [0.0274, 31.3397], the 
shortest entry referring to 10 days and the longest one to more than 31 years; 

8. ClaimNb: number of claims N; for a given feature; 

9. ClaimAmount: total claim amount for a give feature (aggregated over all 
claims). 


We start with a descriptive and exploratory analysis of the Swedish motorcycle 
data of Listing 13.4. We have n = 62'036 different feature combinations with 
positive Exposure. This Exposure is aggregated over individual policies with a 
fixed feature combination. We denote by N; the number of claims on feature 7, this 
corresponds to ClaimNb, and the total claim amount ClaimAmount is denoted 


by 5 = >> j= 1 Zi,ją Where Z;,; are the individual claim sizes on feature i (in case 


of claims). The empirical claim frequency is A = )~"_, Ni/ X] vi = 1.05%, and 
the average claim size is ñ = )~7_, Si/ )-/_, Ni = 24'641 Swedish crowns SEK. 
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Fig. 13.18 (lhs) Boxplot of Exposure on the log-scale (the horizontal line corresponds to 1 
accounting year), (rhs) histogram of the number of observed claims ClaimNb per feature of the 
Swedish motorcycle data 


Figure 13.18 shows the boxplot over all Exposures and the claim counts on all 
insurance policies. We note that insurance claims are rare events for this product, 
because the empirical claim frequency is only A = 1.05%. 

Figures 13.19 and 13.20 give the marginal total exposures (split by gender), the 
marginal claim frequencies and the marginal average claim amounts for the covari- 
ate components OwnerAge, Area, RiskClass, VehAge and BonusClass. 
We observe that we have a very imbalanced portfolio between genders, only 11% 
of the total exposure is coming from females. The empirical claim frequency of 
females is 0.86% and the one of males is 1.08%. We note that the female claim 
frequency comes from (only) 61 claims (based on an exposure for female of 7’094 
accounting years, versus 57’679 for male). Therefore, it is difficult to analyze 
females separately, and all marginal claim frequencies and claim sizes in Figs. 13.19 
and 13.20 (middle and rhs) are analyzed jointly for both genders. If we run a simple 
Poisson GLM that only involves Gender as feature component, it turns out that 
the female frequency is 20% lower than the male frequency (remember we have 
the balance property on each dummy variable, see Example 5.12), but this variable 
should not be kept in the model on a 5% significance level. The same holds for claim 
amounts. 

The empirical marginal frequencies in Figs. 13.19 and 13.20 (middle) are 
complemented with confidence bands of +2 standard deviations. From the plots 
we conclude that we should keep the explanatory variables OwnerAge, Area, 
RiskClass and VehAge, but the variable BonusClass does not seem to have 
any predictive power. At the first sight, this seems surprising because the bonus class 
encodes the past claims history. The reason that the bonus class is not needed for our 
claims is that we consider comprehensive insurance for motorcycles covering loss 
or damage of motorcycles other than collision (for instance, caused by theft, fire or 
vandalism), and the bonus class encodes collision claims. 
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Fig. 13.19 (Top, middle and bottom rows) OwnerAge, Area, RiskClass: (lhs) histogram of 
exposures (split by gender), (middle) observed claim frequency, (rhs) boxplot of observed average 
claim amounts j4; = S;/N; of features with N; > 0 (on log-scale) 


For a regression analysis Zones 5 to 7 should be merged because of small 
exposures and a similar behavior, the same applies to RiskClass 6 and 7, and 
VehAge above 20. 

Figure 13.21 shows the correlations between the features: (top) correlations between 
continuous features, (bottom), dependence between continuous features and the 
categorical Area features. We have some dependence, for instance, in Zone 1 
(three largest Swedish cities) the motorcycles are more light (RiskClass) and 
less old. Older people drive less heavy motorcycles that are more old, and older 
motorcycles are less heavy. 

Figure 13.22 gives the empirical density, empirical distribution and log-log plot of 
average claim amounts Q; = S;/N;. From the log-log plot we conclude that the 
average claim amounts are not heavy-tailed for this motorcycle insurance product. 
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Fig. 13.20 (Top and bottom rows) VehAge, BonusClass: (lhs) histogram of exposures (split 
by gender), (middle) observed claim frequency, (rhs) boxplot of observed average claim amounts 
Ri = Si /N;i of features with N; > 0 (on log-scale) 


13.3 Wisconsin Local Government Property Insurance Fund 


The third example considers property insurance claims of the Wisconsin Local 
Government Property Insurance Fund (LGPIF). This data8 has been made available 
through the book project of Frees [135],° and is also used in Lee et al. [236]. The 
Wisconsin LGPIF is an insurance pool that is managed by the Wisconsin Office 
of the Insurance Commissioner. This fund provides insurance protection to local 
governmental institutions such as counties, schools, libraries, airports, etc. It insures 
property claims for buildings and motor vehicles, and it excludes certain natural and 
man made perils like flood, earthquakes or nuclear accidents. We give a description 
of the data (we have applied some data cleaning to the original data). 

The special feature of this data is that we have a short claim description on line 11 
of Listing 13.5. This description will allow us to better understand the claim type 
beyond just knowing the hazard type that has been affected. 

Figure 13.23 gives the empirical density (upper-truncated at 50’000) and the log-log 
plot of the observed LGPIF claim amounts. Most claims are below 10’000, however, 
the log-log plot shows clearly that the data is heavy-tailed, the largest claim being 


8 https://github.com/OpenActTexts/Loss-Data- Analytics/tree/master/Data. 
? https://ewfrees. github.io/Loss-Data- Analytics/. 
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Fig. 13.21 (Top) Correlations: top-right shows Pearson’s correlation; bottom-left shows Spear- 
man’s Rho; (bottom) boxplots of OwnerAge, RiskClass, VehAge versus Area (where Zones 
5-7 have been merged) 
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Fig. 13.22 (lhs) Empirical density (middle) empirical distribution and (rhs) log-log plot of average 
claim amounts (4; = S;/N; of features with N; > 0 


12°922’218 and 13 claims being above | million. These claims are further described 
by the features given in Listing 13.5. 

In our example we will not focus on modeling the claim sizes, but we rather 
aim at predicting the hazard types from the claim descriptions. There are 9 different 
hazard types: Fire, Lightning, Hail, Wind, WaterW, WaterNW, Vehicle, Vandalism 
and Misc. The last label contains all claims that cannot be allocated to one of 
the previous hazard types, and WaterW refers to weather related water claims and 
WaterNW to the non-weather related ones. If we only focus on this latter problem 
we have more data available as there is a training data set and a validation data 
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Fig. 13.23 (Ihs) Empirical density (upper-truncated at 50°000), (rhs) log-log plot of the observed 
LGPIF claim amounts 


Listing 13.5 Excerpt of the Wisconsin LGPIF data set 


‘data.frame’: 


HUM VW UF FU 


PolicyNum 
Year 

Claim 

Deduct 
EntityType 
CoverageCode: 
Fire5 
CountyCode 
Hazard 
Description 


5424 obs. of 10 variables: 
int 120002 120003 120003 120003 120003 120003 120003 
int 2010 2007 2008 2007 2009 2010 2007 2007 2009 2007 
num 6839 2085 8775 600 34610 
int 1000 5000 5000 5000 5000 5000 5000 5000 5000 5000 


Factor w/ 6 levels "City","County",..: 2222222222 
Factor w/ 13 levels "CE","CF","CS",..: 12 12 11 11 11 12 
int4000000000 

Factor w/ 72 levels "ADA","ASH","BAR",..: 2 3 3 3 3333... 
Factor w/ 9 levels "Fire","Hail",..: 3355963333 


chr "lightning damage" "lightning damage at Comm. Center" 


set with hazard types and claim descriptions.!° In total we have 6’031 such claim 
descriptions, see Listing 13.6, which are studied in our text recognition Chap. 10. 


Listing 13.6 Excerpt of the Wisconsin LGPIF claim descriptions 


‘data.frame’: 


Hazard 
Description: 


6031 obs. of 2 variables: 


Factor w/ 9 levels "Fire","Hail",..: 13355936 
chr "fire damage at Town Hall" 


"lightning damage at water tower" 


10 https://github.com/OpenActTexts/Loss-Data-Analytics/tree/master/Data. 
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13.4 Swiss Accident Insurance Data 


Our next example considers Swiss accident insurance data.!! This data set is not 
publicly available. Swiss accident insurance is compulsory for employees, i.e., by 
law each employer has to sign an insurance contract to protect the employees against 
accidents. This insurance cover includes both work and leisure accidents, and it 
covers medical expenses and daily allowance. Listing 13.7 gives an excerpt of the 
data. Line BU indicates whether we have a workplace or a leisure accident, line 
10 gives the medical expenses and line 12 shows the allowance expenses. In the 
subsequent analysis we only consider medical expenses. 


Listing 13.7 Excerpt of the Swiss accident insurance data set 


‘data.frame’: 339500 obs. of 11 variables: 

$ Id z int 123456878910- 

$ BU : Factor w/ 2 levels "1", "2"; 1122212221... 

$ Sector : Factor w/ 24 levels "5","12","13",..: 5 10 13 7 12 13 4 211... 
$ AccQuart : int 3213441213... 

$ RepDel : num 0000100000... 

$ Age : num 45 20 20 20 60 55 30 25 20 20 ... 

$ InjType : Factor w/ 19 levels "1","2","3","4",..: 76413162644... 
$ InjPart : Factor w/ 35 levels "1","2","3","4",..: 20 28 28 20 14 23 2... 
$ Claim : num 562 6675 700 57 2382 ... 

$ NumbPaym : num 2221131111... 

$ Allowance: num 2345 5554 21 0 395 ... 


Sector indicates the labor sector of the insured company, AccQuart gives the 
accident quarter since leisure claims have a seasonal component, RepDe 1 gives the 
reporting delay in yearly units, Age is the age of the injured (in 5 years buckets), 
and Inj Type and Inj Part denote the injury type and the injured body part. 
Figure 13.24 gives the empirical density (upper-truncated at 10’000) and the log- 
log plot of the observed Swiss accident insurance claim amounts. Most claims are 
below 5’000, however, the log-log plot shows some heavy-tailedness, the largest 
claim exceeding 1’300’000 CHF. 

Figure 13.25 shows the average claim amounts split w.r.t. the different feature 
components (top) Sector, AccQuart, RepDel, (bottom) Age, InjType, 
Inj Part, and moreover, split by work and leisure accidents (in cyan and gray 
in the colored version). Typically, leisure accidents are more numerous and more 
expensive on average than accidents at the work place. From Fig. 13.25 (top, left) 
we observe considerable variability in average claim sizes between the different 
labor sectors (cyan bars), whereas average leisure claim sizes (gray bars) are similar 


1 hetps://www.unfallstatistik.ch/. 
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Fig. 13.24 (lhs) Empirical density (upper-truncated at 10’000), (rhs) log-log plot of the observed 
Swiss accident insurance claim amounts 
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Fig. 13.25 Average claim amounts split w.r.t. the different feature components (top) Sector, 
AccQuart, RepDel, (bottom) Age, InjType, InjPart, and split by work and leisure 
accidents (cyan/gray in the colored version) 


across the different labor sectors. Average claim sizes considerably differ between 
injury types and injured body parts (bottom, middle and right), but they do not differ 
between work and leisure claims. 
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Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons licence and 
indicate if changes were made. 
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